1
play

1 s Markov Decision Processes Graphical View of MDP a s, a - PDF document

Logistics 1 HW 1 Monte Monte- -Carlo Planning: Carlo Planning: Basic Principles and Recent Progress Basic Principles and Recent Progress Dan Weld UW CSE 573 October 2012 Most slides by Alan Fern Consistency &


  1. Logistics 1 – HW 1 Monte Monte- -Carlo Planning: Carlo Planning: Basic Principles and Recent Progress Basic Principles and Recent Progress Dan Weld – UW CSE 573 October 2012 Most slides by Alan Fern  Consistency & admissability EECS, Oregon State University  Correct & resubmit by Mon 10/22 for 50% of missed points A few from me, Dan Klein, Luke Zettlmoyer, etc 2 1 Logistics 2 Logistics 3  HW2 – due tomorrow evening  HW3 – due Mon10/29 Projects  Value iteration  Teams (~3 people)  Understand terms in Bellman eqn  Q-learning  Ideas  Function approximation & state abstraction 3 4 Stochastic/Probabilistic Planning: Outline Markov Decision Process (MDP) Model  Recap: Markov Decision Processes Actions State + Reward  What is Monte-Carlo Planning? World (possibly stochastic)  Uniform Monte-Carlo  Single State Case (PAC Bandit)  Policy rollout Policy rollout ????  Sparse Sampling  Adaptive Monte-Carlo  Single State Case (UCB Bandit)  UCT Monte-Carlo Tree Search  Reinforcement Learning We will model the world as an MDP. 5 6 1

  2. s Markov Decision Processes Graphical View of MDP a s, a s,a,s’ s’ S t S t+1 S t+2 $R A t A t+1 A t+2 An MDP has four components: S, A, P R , P T :  finite state set S R t R t R t+1 R R t+2 R t 2 1  finite action set A  Transition distribution P T (s’ | s, a)  Probability of going to state s’ after taking action a in state s  First-Order Markovian dynamics (history independence)  Next state only depends on current state and current action  First-order Markov model  Bounded reward distribution P R (r | s, a)  First-Order Markovian reward process  Probability of receiving immediate reward r after exec a in s  Reward only depends on current state and action  First-order Markov model 7 8 Policies (“plans” for MDPs) Recap: Defining MDPs  Given an MDP we wish to compute a policy  Policy,   Could be computed offline or online.  Function that chooses an action for each state  A policy is a possibly stochastic mapping from states to actions  Value function of policy  π: S → A  Aka Utility  π (s) is action to do at state s ( )  Sum of discounted rewards from following policy  S f di t d d f f ll i li π (s)  specifies a continuously reactive controller  Objective?  Find policy which maximizes expected utility, V(s) How to measure goodness of a policy? 10 Value Function of a Policy Relation to Infinite Horizon Setting  We consider finite-horizon discounted reward,  Often value function V π (s) is defined over infinite discount factor 0 ≤ β < 1 horizons for a discount factor 0 ≤ β < 1  V π (s,h) denotes expected h-horizon discounted total      t t V ( s ) E [ R | , s ] reward of policy π at state s   0 t  Each run of π for h steps produces a random reward  It is easy to show that difference between V (s h) and  It is easy to show that difference between V π (s,h) and sequence: R 1 R 2 R 3 … R h R R R R V π (s) shrinks exponentially fast as h grows  V π (s,h) is the expected discounted sum of this sequence   h    R    t      V ( s , h ) E R | , s h    ( ) ( , ) max V s V s h     t       1  0 t  Optimal policy π * is policy that achieves maximum  h-horizon results apply to infinite horizon setting value across all states 11 12 2

  3. Computing the Best Policy Bellman Equations for MDPs  Optimal policy maximizes value at each state (1920 ‐ 1984) Q*(a, s)  Optimal policies guaranteed to exist [Howard, 1960]  When state and action spaces are small and MDP is known we find optimal policy in poly-time p p y p y  With value iteration  Or policy Iteration  Both use…? 14 Bellman Backup Computing the Best Policy V i+1 V i What if…  Space is exponentially large? s 1  MDP transition & reward models are unknown? a 1 V 0 = 0 V 1 = 6.5 5 s 0 a 2 s 2 V 0 = 1 a 3 s 3 V 0 = 2 max 16 Large Worlds: Monte-Carlo Approach Large Worlds: Model-Based Approach  Often a simulator of a planning domain is available 1. Define a language for compactly describing MDP or can be learned from data model, for example:  Even when domain can’t be expressed via MDP language  Dynamic Bayesian Networks  Probabilistic STRIPS/PDDL Fire & Emergency Response g y p 2. Design a planning algorithm for that language D i l i l i h f h l Klondike Solitaire Problem: more often than not, the selected language is inadequate for a particular problem, e.g.  Problem size blows up  Fundamental representational shortcoming 17 18 18 3

  4. Large Worlds: Monte-Carlo Approach Example Domains with Simulators  Traffic simulators  Robotics simulators Monte-Carlo Planning: compute a good policy for  Military campaign simulators an MDP by interacting with an MDP simulator  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal action World  Sports domains (Madden Football) Simulator Real  Board games / Video games World  Go / RTS State + reward In many cases Monte-Carlo techniques yield state-of-the-art performance. Even in domains where model-based planner is applicable. 19 19 20 MDP: Simulation-Based Representation Slot Machines as MDP?  A simulation-based representation gives: S, A, R, T:  finite state set S (generally very large)  finite action set A  Stochastic, real-valued, bounded reward function R(s,a) = r …  Stochastically returns a reward r given input s and a  Can be implemented in arbitrary programming language  Can be implemented in arbitrary programming language ????  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  T can be implemented in an arbitrary programming language 21 22 Single State Monte-Carlo Planning Outline  Suppose MDP has a single state and k actions  Preliminaries: Markov Decision Processes  Figure out which action has best expected reward  What is Monte-Carlo Planning?  Can sample rewards of actions using calls to simulator  Sampling a is like pulling slot machine arm with random  Uniform Monte-Carlo payoff function R(s,a)  Single State Case (Uniform Bandit) s  Policy rollout Policy rollout a k a 1 a 2  Sparse Sampling  Adaptive Monte-Carlo …  Single State Case (UCB Bandit)  UCT Monte-Carlo Tree Search … R(s,a 2 ) R(s,a k ) R(s,a 1 ) Multi-Armed Bandit Problem 23 24 4

  5. UniformBandit Algorithm PAC Bandit Objective NaiveBandit from [Even-Dar et. al., 2002] Probably Approximately Correct (PAC) 1. Pull each arm w times (uniform pulling).  Select an arm that probably (w/ high probability, 1-  ) has approximately (i.e., within  ) the best expected reward 2. Return arm with best average reward. s  Use as few simulator calls (or pulls) as possible s a 1 a 2 a k a 1 a 2 a k … … r 11 r 12 … r 1w r 21 r 22 … r 2w … r k1 r k2 … r kw … R(s,a 1 ) R(s,a 2 ) R(s,a k ) How large must w be to provide a PAC guarantee? Multi-Armed Bandit Problem 25 26 UniformBandit Algorithm Aside: Additive Chernoff Bound NaiveBandit from [Even-Dar et. al., 2002] • Let R be a random variable with maximum absolute value Z. An let r i (for i=1,…,w) be i.i.d. samples of R 1. Pull each arm w times (uniform pulling). • The Chernoff bound gives a bound on the probability that the 2. Return arm with best average reward. average of the r i are far from E[R] s      2     w Chernoff    a 1 a 2 a k           1 Pr E [ [ R ] ] r exp p w         B Bound d w i i       Z    i 1 … Equivalently:   With probability at least we have that, 1 r 11 r 12 … r 1w r 21 r 22 … r 2w … r k1 r k2 … r kw   w  [ ] 1 1 ln 1 E R r Z  w i w How large must w be to provide a PAC guarantee?  1 i 27 28 UniformBandit PAC Bound # Simulator Calls for UniformBandit s With a bit of algebra and Chernoff bound we get: a 1 a 2 a k 2   R max    If for all arms simultaneously w ln k       w   … [ ( , )] 1 E R s a r i ij w  j 1 …   R(s,a k ) with probability at least R(s,a 1 ) R(s,a 2 ) 1   k      That is, estimates of all actions are ε –accurate with  Total simulator calls for PAC: k w O ln k    2   probability at least 1-  Thus selecting estimate with highest value is  Can get rid of ln(k) term with more complex approximately optimal with high probability, or PAC algorithm [Even-Dar et. al., 2002]. 29 30 5

Recommend


More recommend