Markov Decision Process Assumption: agent gets to observe the state - PDF document

Discretization Pieter Abbeel UC Berkeley EECS Markov Decision Process Assumption: agent gets to observe the state [Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998] Page 1 �

Markov Decision Process (S, A, T, R, H) Given S: set of states n A: set of actions n T: S x A x S x {0,1,…,H} à [0,1], T t (s,a,s’) = P( s t+1 = s’ | s t = s, a t =a) n R: S x A x S x {0, 1, …, H} à < R t (s,a,s’) = reward for ( s t+1 = s’, s t = s, a t =a) n H: horizon over which the agent will act n Goal: Find ¼ : S x {0, 1, …, H} à A that maximizes expected sum of rewards, i.e., n Value Iteration n Idea: n = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps n Algorithm: n Start with for all s. n For i=1, … , H for all states s 2 S: n Action selection: Page 2 �

Continuous State Spaces n S = continuous set n Value iteration becomes impractical as it requires to compute, for all states s 2 S: Markov chain approximation to continuous state space dynamics model (“discretization”) n Original MDP (S, A, T, R, H) n Grid the state-space: the vertices are the discrete states. n Reduce the action space to a finite set. n Sometimes not needed: n When Bellman back-up can be computed exactly over the continuous action space n When we know only certain controls are part of the optimal policy (e.g., when we know the problem has a “bang-bang” optimal solution) n Transition function: see next few slides. n Discretized MDP Page 3 �

Discretization Approach A: Deterministic Transition onto Nearest Vertex --- 0’th Order Approximation 0.1 a 0.3 » 2 » 1 » 3 Discrete states: { » 1 , …, » 6 } 0.4 0.2 Similarly define transition » 4 » 5 » 6 probabilities for all » i à à Discrete MDP just over the states { » 1 , …, » 6 }, which we can solve with value n iteration If a (state, action) pair can results in infinitely many (or very many) different next states: n Sample next states from the next-state distribution Discretization Approach B: Stochastic Transition onto Neighboring Vertices --- 1’st Order Approximation » 2 » 3 » 4 » 1 a s ’ » 6 » 7 Discrete states: { » 1 , …, » 12 } » 5 » 8 » 9 » 10 » 11 » 12 If stochastic: Repeat procedure to account for all possible transitions and n weight accordingly Need not be triangular, but could use other ways to select neighbors that n contribute. “Kuhn triangulation” is particular choice that allows for efficient computation of the weights p A , p B , p C , also in higher dimensions Page 4 �

Discretization: Our Status n Have seen two ways to turn a continuous state-space MDP into a discrete state-space MDP n When we solve the discrete state-space MDP, we find: n Policy and value function for the discrete states n They are optimal for the discrete MDP, but typically not for the original MDP n Remaining questions: n How to act when in a state that is not in the discrete states set? n How close to optimal are the obtained policy and value function? How to Act (i): 0-step Lookahead For non-discrete state s choose action based on policy in nearby states n n Nearest Neighbor: n (Stochastic) Interpolation: Page 5 �

How to Act (ii): 1-step Lookahead Use value function found for discrete MDP n n Nearest Neighbor: n (Stochastic) Interpolation: How to Act (iii): n-step Lookahead n Think about how you could do this for n-step lookahead n Why might large n not be practical in most cases? Page 6 �

Example: Double integrator---quadratic cost n Dynamics: q, u ) = q 2 + u 2 n Cost function: g ( q, ˙ 0’th Order Interpolation, 1 Step Lookahead for Action Selection --- Trajectories Nearest neighbor, h = 1 optimal Nearest neighbor, h = 0.02 Nearest neighbor, h = 0.1 dt=0.1 Page 7 �

0’th Order Interpolation, 1 Step Lookahead for Action Selection --- Resulting Cost 1 st Order Interpolation, 1-Step Lookahead for Action Selection --- Trajectories Kuhn triang., h = 1 optimal Kuhn triang., h = 0.02 Kuhn triang., h = 0.1 Page 8 �

1 st Order Interpolation, 1-Step Lookahead for Action Selection --- Resulting Cost Discretization Quality Guarantees n Typical guarantees: n Assume: smoothness of cost function, transition model n For h à 0, the discretized value function will approach the true value function n To obtain guarantee about resulting policy, combine above with a general result about MDP’s: n One-step lookahead policy based on value function V which is close to V* is a policy that attains value close to V* Page 9 �

Quality of Value Function Obtained from Discrete MDP: Proof Techniques n Chow and Tsitsiklis, 1991: n Show that one discretized back-up is close to one “complete” back- up + then show sequence of back-ups is also close n Kushner and Dupuis, 2001: n Show that sample paths in discrete stochastic MDP approach sample paths in continuous (deterministic) MDP [also proofs for stochastic continuous, bit more complex] n Function approximation based proof (see later slides for what is meant with “function approximation”) n Great descriptions: Gordon, 1995; Tsitsiklis and Van Roy, 1996 Example result (Chow and Tsitsiklis,1991) Page 10 �

Value Iteration with Function Approximation Provides alternative derivation and interpretation of the discretization methods we have covered in this set of slides: n Start with for all s. n For i=1, … , H for all states , where is the discrete state set where 0’th Order Function Approximation 1 st Order Function Approximation Discretization as function approximation n 0’th order function approximation builds piecewise constant approximation of value function n 1 st order function approximatin builds piecewise (over “triangles”) linear approximation of value function Page 11 �

Kuhn triangulation n Allows efficient computation of the vertices participating in a point’s barycentric coordinate system and of the convex interpolation weights (aka the barycentric coordinates) n See Munos and Moore, 2001 for further details. Kuhn triangulation (from Munos and Moore) Page 12 �

[[Continuous time ]] One might want to discretize time in a variable way such that one n discrete time transition roughly corresponds to a transition into neighboring grid points/regions Discounting: n ± t depends on the state and action See, e.g., Munos and Moore, 2001 for details. Note: Numerical methods research refers to this connection between time and space as the CFL (Courant Friedrichs Levy) condition. Googling for this term will give you more background info. !! 1 nearest neighbor tends to be especially sensitive to having the correct match [Indeed, with a mismatch between time and space 1 nearest neighbor might end up mapping many states to only transition to themselves no matter which action is taken.] Nearest neighbor quickly degrades when time and space scale are mismatched dt= 0.01 dt= 0.1 h = 0.1 h = 0.02 Page 13 �

Markov Decision Process Assumption: agent gets to observe the state - PDF document

Discretization Pieter Abbeel UC Berkeley EECS Markov Decision Process Assumption: agent gets to observe the state [Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998] Page 1 Markov Decision Process (S, A, T, R, H)

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

How PyTorch Scales Deep Learning from Experimentation to Production Vincent Quenneville-Blair,

Separating value functions across time-scales Joshua Romoff* 1,2 , Peter Henderson* 3 , Ahmed

Announcements Assignment 1 is out, due Fri Sept 22 Presentation assignments - up this week

Learning to plan: Applications of search to robotics Kevin Xie* and Homanga Bharadhwaj* *1st

Numerical Evidences for QED 3 being Scale-invariant Nikhil Karthik and Rajamani Narayanan

Likert-Scale Fuzzy Uncertainty from a Traditional Decision Making Viewpoint: Incorporating both

Heavy Spinning Particles from Signs of Primordial Non-Gaussianities Introduction: Inflation

Convolution matrix reversed impulse response impulse response boundaries? 8 Feedback LSI