LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V h h h h h s 5 s 6 s 7 s 8 s 6 s 7 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 41
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 42
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 43
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 44
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 45
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 46
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 47
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 48
LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 49
LAO* V s 0 M#1: some states s 0 can be ignored for efficient compuation s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 s 8 S g S g s 4 was never expanded s 8 was never touched 50
LAO* [Hansen&Zilberstein 98] add s 0 to the fringe and to greedy policy graph one expansion repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph lot of computation until greedy graph has no fringe output the greedy graph as the final policy 51
Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s VI iterations until greedy graph changes (or low residuals) recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 52
Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat FIND: expand all states in greedy fringe initialize all new states by their heuristic value subset = all states in expanded graph that can reach s VI iterations until greedy graph changes (or low residuals) recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 53
iLAO* [Hansen&Zilberstein 01] add s 0 to the fringe and to greedy policy graph repeat FIND: expand all states in greedy fringe initialize all new states by their heuristic value subset = all states in expanded graph that can reach s only one backup per state in greedy graph recompute the greedy graph until greedy graph has no fringe in what order? (fringe start) DFS postorder output the greedy graph as the final policy 54
Real Time Dynamic Programming [Barto et al 95] • Original Motivation – agent acting in the real world • Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal No termination condition! • RTDP: repeat trials forever – Converges in the limit #trials ! 1 55
Trial s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g 56
Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g start at start state repeat perform a Bellman backup simulate greedy action 57
Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 58
Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 59
Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 60
Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action 61
Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action until hit the goal 62
Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h RTDP start at start state repeat repeat perform a Bellman backup forever simulate greedy action until hit the goal 63
RTDP Family of Algorithms repeat s à s 0 repeat //trials REVISE s; identify a greedy FIND: pick s’ s.t. T(s, a greedy , s’) > 0 s à s’ until s 2 G until termination test 64
Termination Test: Labeling • Admissible heuristic ⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a) • Label a state s as solved – if V(s) has converged best action s s g Res V (s ) < ² ) V(s ) won’t change ! ) label s as solved
Labeling (contd) best action s s g s' Res V (s ) < ² s' already solved ) V(s ) won’t change ! ) label s as solved 66
Labeling (contd) M#3: some algorithms use explicit best action best action knowledge of goals s s g s s g best action s' s' Res V (s ) < ² s' already solved Res V (s ) < ² ) V(s ) won’t change ! ) M#1: some states Res V (s’ ) < ² can be ignored for efficient computation label s as solved V(s), V(s’) won’t change ! label s, s’ as solved 67
Labeled RTDP [Bonet&Geffner 03b] repeat s à s 0 label all goal states as solved repeat //trials REVISE s; identify a greedy FIND: sample s’ from T(s, a greedy , s ’) s à s’ until s is solved for all states s in the trial try to label s as solved until s 0 is solved 68
LRTDP • terminates in finite time – due to labeling procedure • anytime – focuses attention on more probable states • fast convergence – focuses attention on unconverged states 69
LRTDP Extensions • Different ways to pick next state • Different termination conditions • Bounded RTDP [McMahan et al 05] • Focused RTDP [Smith&Simmons 06] • Value of Perfect Information RTDP [Sanner et al 09] 70
Where do Heuristics come from? • Domain-dependent heuristics • Domain-independent heuristics – dependent on specific domain representation M#2: factored representations expose useful problem structure 71
Take-Homes • efficient computation given start state s 0 – heuristic search • automatic computation of heuristics – domain independent manner
Shameless Plug 74
Agenda • Background: Stochastic Shortest Paths MDPs • Background: Heuristic Search for SSP MDPs • Algorithms: Automatic Basis Function Discovery • Models: SSPs Generalized SSPs
Previous Work Our Work • Determinization • Function Approximation – Determinize the MDP – Dimensionality reduction – Classical planners fast – Represent state values – E.g., FF-Replan with basis functions – Cons: may be troubled by • E.g., V*(s) ≈ ∑ i w i b i (s) – Cons: • Complex contingencies • Probabilities • Need a human to get b i Marry these paradigms to extract problem-specific structure in a fast, problem-independent way. 76
Example Domain G G G e e e t t t H S W 78
Example Domain (cont’d) T S w m e a a s k h 79
SSP s0 MDP • S: A set of states • A: A set of actions • T(s,a,s ’): transition GetW, GetH, GetS, Tweak, Smash model • C(s,a,s ’): action cost • s 0 : start state • G: set of goals
Contributions ReTrASE — a scalable approximate MDP solver – Combines function approximation with classical planning – Uses classical planner to automatically generate basis functions – Fast, memory-efficient, high-quality policies 81
The Big Picture: ReTrASE [Kolobov, Mausam, Weld, AIJ’12] Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 82
Determinizing the Domain P = 9/10 P = 1/10 83
Generating Trajectories Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 84
Generating Trajectories 85
Computing Basis Functions Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 86
Regressing Trajectories basis function guarantees goal is reachable from s = 1 Initial weights basis funct ctions ions = 2 87
Basis Functions 88
Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 89
Meaning of Basis Function Weights Want to compute basis function weights so that the blue basis function looks “better” than the pink one! 90 90
Value of a Basis Function • Basis function enables at least one trajectory – applicable from all relevant states • Trajectories combine to form policies • Value of a basis function ~ “quality” of its policies • Algorithm based on RTDP – Learn basis function values – Use them to compute values of states 91
Experimental Results • Criteria: – Scalability (vs. VI/RTDP-based planners) – Solution quality (vs. IPPC winners) • Domains: 6 from IPPC-06 and IPPC-08 • Competitors: – Best performer on the particular domain – Best performer in the particular IPPC – LRTDP 92
The Big Picture • ReTrASE is vastly more scalable than VI/RTDP-based planners • ReTrASE typically rivals or outperforms the best-performing planners on IPPC goal- oriented domains 93
Triangle-Tire: Memory Consumption LRTDP OPT LRTDP FF LOG 10 (Amount of Memory) ReTrASE Triangle-Tire Problem # 94
Triangle-Tire: Success Rate ReTrASE HMDPP % of Successful Trials RFF-PG Triangle- Tire World’08 Problem # 95
Exploding Blocks World: Success Rate ReTrASE FFReplan % of Successful Trials ~2 800 states! FPG Exploding Blocks World’06 Problem # 96
SSP s0 • S: A set of states • A: A set of actions • T(s,a,s ’): transition model ? • C(s,a,s ’): cost • G: set of goals • s 0 : start state Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 97
Key Drawback of ReTrASE … • Dead-end handling expensive – expensive to identify: drain on time – too many to store: drain on space
Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 99
Research Question Can we devise a sound dead-end identification procedure fast enough to obviate memoization? Learns feature combinations whose presence guarantees a state to be a dead end 100
Nogoods Nogood 101
Generate-and-Test Procedure • Generate a nogood candidate – Key insight: Nogood = conjunction that defeats all b.f.s – For each b.f., pick a literal that defeats it • Test the candidate – Needed for soundness , since we don’t know all b.f.s – Use the non-relaxed Planning Graph algorithm 102
Benefits of SixthSense • Can act as submodule of many planners and ID dead ends – By checking discovered nogoods against every state – 110
Recommend
More recommend