markov decision processes and dynamic programming
play

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture How do we formalize the agent-environment interaction? Markov


  1. Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. In This Lecture ◮ How do we formalize the agent-environment interaction? ⇒ Markov Decision Process (MDP) ◮ How do we solve an MDP? ⇒ Dynamic Programming A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 2/79

  3. Mathematical Tools Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 3/79

  4. Mathematical Tools Probability Theory Definition (Conditional probability) Given two events A and B with P ( B ) > 0 , the conditional probability of A given B is P ( A | B ) = P ( A ∪ B ) . P ( B ) Similarly, if X and Y are non-degenerate and jointly continuous random variables with density f X , Y ( x , y ) then if B has positive measure then the conditional probability is � � x ∈ A f X , Y ( x , y ) dxdy y ∈ B P ( X ∈ A | Y ∈ B ) = x f X , Y ( x , y ) dxdy . � � y ∈ B A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 4/79

  5. Mathematical Tools Probability Theory Definition (Law of total expectation) Given a function f and two random variables X , Y we have that � �� � � � f ( X , Y ) = E X f ( x , Y ) | X = x . E X , Y E Y A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 5/79

  6. Mathematical Tools Norms and Contractions Definition Given a vector space V ⊆ R d a function f : V → R + 0 is a norm if an only if ◮ If f ( v ) = 0 for some v ∈ V , then v = 0 . ◮ For any λ ∈ R , v ∈ V , f ( λ v ) = | λ | f ( v ) . ◮ Triangle inequality: For any v , u ∈ V , f ( v + u ) ≤ f ( v ) + f ( u ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 6/79

  7. Mathematical Tools Norms and Contractions ◮ L p -norm d � 1 / p � � | v i | p || v || p = . i = 1 ◮ L ∞ -norm || v || ∞ = max 1 ≤ i ≤ d | v i | . ◮ L µ, p -norm d � 1 / p | v i | p � � || v || µ, p = . µ i i = 1 ◮ L µ, p -norm | v i | || v || µ, ∞ = max . µ i 1 ≤ i ≤ d ◮ L 2 , P -matrix norm ( P is a positive definite matrix) || v || 2 P = v ⊤ Pv . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 7/79

  8. Mathematical Tools Norms and Contractions Definition A sequence of vectors v n ∈ V (with n ∈ N ) is said to converge in norm || · || to v ∈ V if n →∞ || v n − v || = 0 . lim Definition A sequence of vectors v n ∈ V (with n ∈ N ) is a Cauchy sequence if n →∞ sup m ≥ n || v n − v m || = 0 . lim Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 8/79

  9. Mathematical Tools Norms and Contractions Definition An operator T : V → V is L-Lipschitz if for any v , u ∈ V ||T v − T u || ≤ L || u − v || . If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n → ||·|| v then T v n → ||·|| T v . Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 9/79

  10. Mathematical Tools Norms and Contractions Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ -contraction mapping. Then 1. T admits a unique fixed point v . 2. For any v 0 ∈ V , if v n + 1 = T v n then v n → ||·|| v with a geometric convergence rate : || v n − v || ≤ γ n || v 0 − v || . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 10/79

  11. Mathematical Tools Linear Algebra Given a square matrix A ∈ R N × N : ◮ Eigenvalues of a matrix (1). v ∈ R N and λ ∈ R are eigenvector and eigenvalue of A if Av = λ v . ◮ Eigenvalues of a matrix (2). If A has eigenvalues { λ i } N i = 1 , then B = ( I − α A ) has eigenvalues { µ i } µ i = 1 − αλ i . ◮ Matrix inversion. A can be inverted if and only if ∀ i , λ i � = 0. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 11/79

  12. Mathematical Tools Linear Algebra ◮ Stochastic matrix. A square matrix P ∈ R N × N is a stochastic matrix if 1. all non-zero entries, ∀ i , j , [ P ] i , j ≥ 0 2. all the rows sum to one, ∀ i , � N j = 1 [ P ] i , j = 1. All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀ i , λ i ≤ 1. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 12/79

  13. The Markov Decision Process Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 13/79

  14. The Markov Decision Process The Reinforcement Learning Model Environment Critic action / state / reward actuation perception Learning Agent A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 14/79

  15. The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system ( x t ) t ∈ N ∈ X is a Markov chain if it satisfies the Markov property P ( x t + 1 = x | x t , x t − 1 , . . . , x 0 ) = P ( x t + 1 = x | x t ) , Given an initial state x 0 ∈ X, a Markov chain is defined by the transition probability p p ( y | x ) = P ( x t + 1 = y | x t = x ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 15/79

  16. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 16/79

  17. The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 17/79

  18. The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 18/79

  19. The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 19/79

  20. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 20/79

  21. The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 21/79

Recommend


More recommend