Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - PowerPoint PPT Presentation

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V h h h h h s 5 s 6 s 7 s 8 s 6 s 7 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 41

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 42

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 43

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 44

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 45

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 46

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 47

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 48

LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 49

LAO* V s 0 M#1: some states s 0 can be ignored for efficient compuation s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 s 8 S g S g s 4 was never expanded s 8 was never touched 50

LAO* [Hansen&Zilberstein 98] add s 0 to the fringe and to greedy policy graph one expansion repeat  FIND: expand best state s on the fringe (in greedy graph)  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  perform VI on this subset  recompute the greedy graph lot of computation until greedy graph has no fringe output the greedy graph as the final policy 51

Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat  FIND: expand best state s on the fringe (in greedy graph)  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  VI iterations until greedy graph changes (or low residuals)  recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 52

Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat  FIND: expand all states in greedy fringe  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  VI iterations until greedy graph changes (or low residuals)  recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 53

iLAO* [Hansen&Zilberstein 01] add s 0 to the fringe and to greedy policy graph repeat  FIND: expand all states in greedy fringe  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  only one backup per state in greedy graph  recompute the greedy graph until greedy graph has no fringe in what order? (fringe  start) DFS postorder output the greedy graph as the final policy 54

Real Time Dynamic Programming [Barto et al 95] • Original Motivation – agent acting in the real world • Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal No termination condition! • RTDP: repeat trials forever – Converges in the limit #trials ! 1 55

Trial s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g 56

Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g start at start state repeat perform a Bellman backup simulate greedy action 57

Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 58

Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 59

Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 60

Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action 61

Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action until hit the goal 62

Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h RTDP start at start state repeat repeat perform a Bellman backup forever simulate greedy action until hit the goal 63

RTDP Family of Algorithms repeat s Ã s 0 repeat //trials REVISE s; identify a greedy FIND: pick s’ s.t. T(s, a greedy , s’) > 0 s Ã s’ until s 2 G until termination test 64

Termination Test: Labeling • Admissible heuristic ⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a) • Label a state s as solved – if V(s) has converged best action s s g Res V (s ) < ² ) V(s ) won’t change ! ) label s as solved

Labeling (contd) best action s s g s' Res V (s ) < ² s' already solved ) V(s ) won’t change ! ) label s as solved 66

Labeling (contd) M#3: some algorithms use explicit best action best action knowledge of goals s s g s s g best action s' s' Res V (s ) < ² s' already solved Res V (s ) < ² ) V(s ) won’t change ! ) M#1: some states Res V (s’ ) < ² can be ignored for efficient computation label s as solved V(s), V(s’) won’t change ! label s, s’ as solved 67

Labeled RTDP [Bonet&Geffner 03b] repeat s Ã s 0 label all goal states as solved repeat //trials REVISE s; identify a greedy FIND: sample s’ from T(s, a greedy , s ’) s Ã s’ until s is solved for all states s in the trial try to label s as solved until s 0 is solved 68

LRTDP • terminates in finite time – due to labeling procedure • anytime – focuses attention on more probable states • fast convergence – focuses attention on unconverged states 69

LRTDP Extensions • Different ways to pick next state • Different termination conditions • Bounded RTDP [McMahan et al 05] • Focused RTDP [Smith&Simmons 06] • Value of Perfect Information RTDP [Sanner et al 09] 70

Where do Heuristics come from? • Domain-dependent heuristics • Domain-independent heuristics – dependent on specific domain representation M#2: factored representations expose useful problem structure 71

Take-Homes • efficient computation given start state s 0 – heuristic search • automatic computation of heuristics – domain independent manner

Shameless Plug 74

Agenda • Background: Stochastic Shortest Paths MDPs • Background: Heuristic Search for SSP MDPs • Algorithms: Automatic Basis Function Discovery • Models: SSPs  Generalized SSPs

Previous Work Our Work • Determinization • Function Approximation – Determinize the MDP – Dimensionality reduction – Classical planners fast – Represent state values – E.g., FF-Replan with basis functions – Cons: may be troubled by • E.g., V*(s) ≈ ∑ i w i b i (s) – Cons: • Complex contingencies • Probabilities • Need a human to get b i Marry these paradigms to extract problem-specific structure in a fast, problem-independent way. 76

Example Domain G G G e e e t t t H S W 78

Example Domain (cont’d) T S w m e a a s k h 79

SSP s0 MDP • S: A set of states • A: A set of actions • T(s,a,s ’): transition GetW, GetH, GetS, Tweak, Smash model • C(s,a,s ’): action cost • s 0 : start state • G: set of goals

Contributions ReTrASE — a scalable approximate MDP solver – Combines function approximation with classical planning – Uses classical planner to automatically generate basis functions – Fast, memory-efficient, high-quality policies 81

The Big Picture: ReTrASE [Kolobov, Mausam, Weld, AIJ’12] Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 82

Determinizing the Domain P = 9/10 P = 1/10 83

Generating Trajectories Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 84

Generating Trajectories 85

Computing Basis Functions Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 86

Regressing Trajectories basis function guarantees goal is reachable from s = 1 Initial weights basis funct ctions ions = 2 87

Basis Functions 88

Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 89

Meaning of Basis Function Weights Want to compute basis function weights so that the blue basis function looks “better” than the pink one! 90 90

Value of a Basis Function • Basis function enables at least one trajectory – applicable from all relevant states • Trajectories combine to form policies • Value of a basis function ~ “quality” of its policies • Algorithm based on RTDP – Learn basis function values – Use them to compute values of states 91

Experimental Results • Criteria: – Scalability (vs. VI/RTDP-based planners) – Solution quality (vs. IPPC winners) • Domains: 6 from IPPC-06 and IPPC-08 • Competitors: – Best performer on the particular domain – Best performer in the particular IPPC – LRTDP 92

The Big Picture • ReTrASE is vastly more scalable than VI/RTDP-based planners • ReTrASE typically rivals or outperforms the best-performing planners on IPPC goal- oriented domains 93

Triangle-Tire: Memory Consumption LRTDP OPT LRTDP FF LOG 10 (Amount of Memory) ReTrASE Triangle-Tire Problem # 94

Triangle-Tire: Success Rate ReTrASE HMDPP % of Successful Trials RFF-PG Triangle- Tire World’08 Problem # 95

Exploding Blocks World: Success Rate ReTrASE FFReplan % of Successful Trials ~2 800 states! FPG Exploding Blocks World’06 Problem # 96

SSP s0 • S: A set of states • A: A set of actions • T(s,a,s ’): transition model ? • C(s,a,s ’): cost • G: set of goals • s 0 : start state Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 97

Key Drawback of ReTrASE … • Dead-end handling expensive – expensive to identify: drain on time – too many to store: drain on space

Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 99

Research Question Can we devise a sound dead-end identification procedure fast enough to obviate memoization? Learns feature combinations whose presence guarantees a state to be a dead end 100

Nogoods Nogood 101

Generate-and-Test Procedure • Generate a nogood candidate – Key insight: Nogood = conjunction that defeats all b.f.s – For each b.f., pick a literal that defeats it • Test the candidate – Needed for soundness , since we don’t know all b.f.s – Use the non-relaxed Planning Graph algorithm 102

Benefits of SixthSense • Can act as submodule of many planners and ID dead ends – By checking discovered nogoods against every state – 110

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - PowerPoint PPT Presentation

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld Planning la Sutton control full sequential model-based value-based

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Finding Strongly Connected Components Directed Acyclic Graphs Directed Acyclic Graphs Directed

Goal-Directed Design User Goals Models Goal-Directed Design Jrg Cassens References SoSe

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Goal - Directed Fluid Resuscitation Goal-Directed Fluid Resuscitation Christopher G.

Goal-Directed Design: Research Goal-Directed Design Understanding the Problem Research

Incidence Relations and Directed Cycles Hao Wu George Washington University Directed graphs and

3.5 Connectivity in Directed Graphs Directed Graphs Directed graph. G = (V, E) Edge (u, v)

6.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

5.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

CS 401 Greedy Algorithms Xiaorui Sun 1 Directed Acyclic Graphs (DAG) Def: A DAG is a directed

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Conventions and Coalitions in Repeated Games S. Nageeb Ali Ce Liu (Will) Penn State UCSD May

P

EVALUATION & QUALITY CARE CONSORTIUM SCI-High Project Overview Objectives Review the

CAP Wate r Use r Roundtable : Colle c tion of F ixe d OM&R E quivale nc y Char ge on

Software Engineering Environments Integrated environments to support large-scale software

Mark Laity Director Communications, S.H.A.P.E.* BRITISH LIBRARY LECTURE SERIES, 19 FEBRUARY 2018

GOVERNING FOR THE FUTURE: ASSESSING THE CONSTITUTIONAL, INSTITUTIONAL AND POLICY OPTIONS TO

Agro-processing, value chains and regional integration in Southern Africa SA-TIED Webinar

Sambuz

Useful Links

Newsletter

Mail Us

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - PowerPoint PPT Presentation

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld Planning la Sutton control full sequential model-based value-based

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Finding Strongly Connected Components Directed Acyclic Graphs Directed Acyclic Graphs Directed

Goal-Directed Design User Goals Models Goal-Directed Design Jrg Cassens References SoSe

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Goal - Directed Fluid Resuscitation Goal-Directed Fluid Resuscitation Christopher G.

Goal-Directed Design: Research Goal-Directed Design Understanding the Problem Research

Incidence Relations and Directed Cycles Hao Wu George Washington University Directed graphs and

3.5 Connectivity in Directed Graphs Directed Graphs Directed graph. G = (V, E) Edge (u, v)

6.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

5.1 Directed Acyclic Graphs Directed acyclic graphs , or DAGs are acyclic directed graphs where

CS 401 Greedy Algorithms Xiaorui Sun 1 Directed Acyclic Graphs (DAG) Def: A DAG is a directed

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Conventions and Coalitions in Repeated Games S. Nageeb Ali Ce Liu (Will) Penn State UCSD May

P

EVALUATION &amp; QUALITY CARE CONSORTIUM SCI-High Project Overview Objectives Review the

CAP Wate r Use r Roundtable : Colle c tion of F ixe d OM&amp;R E quivale nc y Char ge on

Software Engineering Environments Integrated environments to support large-scale software

Mark Laity Director Communications, S.H.A.P.E.* BRITISH LIBRARY LECTURE SERIES, 19 FEBRUARY 2018

GOVERNING FOR THE FUTURE: ASSESSING THE CONSTITUTIONAL, INSTITUTIONAL AND POLICY OPTIONS TO

Agro-processing, value chains and regional integration in Southern Africa SA-TIED Webinar

Sambuz

Useful Links

Newsletter

Mail Us

EVALUATION & QUALITY CARE CONSORTIUM SCI-High Project Overview Objectives Review the

CAP Wate r Use r Roundtable : Colle c tion of F ixe d OM&R E quivale nc y Char ge on