The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ t is the time clock, A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ t is the time clock, ◮ X is the state space, A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming 17/81
The Markov Decision Process Examples ◮ Park a car ◮ Find the shortest path from home to school ◮ Schedule a fleet of truck A. LAZARIC – Markov Decision Processes and Dynamic Programming 18/81
The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A. LAZARIC – Markov Decision Processes and Dynamic Programming 19/81
The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming 19/81
The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Processes and Dynamic Programming 19/81
The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Processes and Dynamic Programming 20/81
The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Processes and Dynamic Programming 21/81
The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . A. LAZARIC – Markov Decision Processes and Dynamic Programming 22/81
The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . A. LAZARIC – Markov Decision Processes and Dynamic Programming 22/81
The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! A. LAZARIC – Markov Decision Processes and Dynamic Programming 22/81
The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . A. LAZARIC – Markov Decision Processes and Dynamic Programming 22/81
The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming 22/81
The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 A. LAZARIC – Markov Decision Processes and Dynamic Programming 23/81
The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Processes and Dynamic Programming 23/81
The Markov Decision Process Question How do we evaluate a policy and compare two policies? ⇒ Value function! A. LAZARIC – Markov Decision Processes and Dynamic Programming 24/81
The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . A. LAZARIC – Markov Decision Processes and Dynamic Programming 25/81
The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. A. LAZARIC – Markov Decision Processes and Dynamic Programming 25/81
The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . A. LAZARIC – Markov Decision Processes and Dynamic Programming 25/81
The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . A. LAZARIC – Markov Decision Processes and Dynamic Programming 25/81
The Markov Decision Process State Value Function ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . � T − 1 � V π ( t , x ) = E � r ( x s , π s ( x s )) + R ( x T ) | x t = x ; π , s = t where R is a value function for the final state. A. LAZARIC – Markov Decision Processes and Dynamic Programming 26/81
The Markov Decision Process State Value Function ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. � ∞ � V π ( x ) = E � γ t r ( x t , π ( x t )) | x 0 = x ; π , t = 0 with discount factor 0 ≤ γ < 1: ◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [ 0 , 1 ) the series always converge (for bounded rewards) A. LAZARIC – Markov Decision Processes and Dynamic Programming 27/81
The Markov Decision Process State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � V π ( x ) = E � r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. A. LAZARIC – Markov Decision Processes and Dynamic Programming 28/81
The Markov Decision Process State Value Function ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . � 1 T − 1 � � V π ( x ) = lim r ( x t , π ( x t )) | x 0 = x ; π . T →∞ E T t = 0 A. LAZARIC – Markov Decision Processes and Dynamic Programming 29/81
The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A. LAZARIC – Markov Decision Processes and Dynamic Programming 30/81
The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x 0 returns ( x 0 , r 0 , x 1 , r 1 , x 2 , r 2 , . . . ) with r t = r ( x t , π t ( x t )) and x t ∼ p ( ·| x t − 1 , a t = π ( x t )) are random realizations. The value function (discounted infinite horizon) is � ∞ � V π ( x ) = E ( x 1 , x 2 ,... ) � γ t r ( x t , π ( x t )) | x 0 = x ; π , t = 0 A. LAZARIC – Markov Decision Processes and Dynamic Programming 30/81
The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. A. LAZARIC – Markov Decision Processes and Dynamic Programming 31/81
The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π ∗ . A. LAZARIC – Markov Decision Processes and Dynamic Programming 31/81
The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π ∗ . Remark : π ∗ ∈ arg max ( · ) and not π ∗ = arg max ( · ) because an MDP may admit more than one optimal policy. A. LAZARIC – Markov Decision Processes and Dynamic Programming 31/81
The Markov Decision Process Example: the EC student dilemma 2 r=1 p=0.5 1 Rest r=−10 0.4 Rest 0.5 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 4 A. LAZARIC – Markov Decision Processes and Dynamic Programming 32/81
The Markov Decision Process Example: the EC student dilemma ◮ Model : all the transitions are Markov, states x 5 , x 6 , x 7 are terminal. ◮ Setting : infinite horizon with terminal states. ◮ Objective : find the policy that maximizes the expected sum of rewards before achieving a terminal state. A. LAZARIC – Markov Decision Processes and Dynamic Programming 33/81
The Markov Decision Process Example: the EC student dilemma V = 88.3 2 p=0.5 r=1 Rest r=−10 0.4 Rest 0.5 V = 88.3 1 V = −10 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 Rest V = 86.9 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 4 A. LAZARIC – Markov Decision Processes and Dynamic Programming 34/81
The Markov Decision Process Example: the EC student dilemma V 7 = − 1000 V 6 = 100 V 5 = − 10 V 4 = − 10 + 0 . 9 V 6 + 0 . 1 V 4 ≃ 88 . 9 V 3 = − 1 + 0 . 5 V 4 + 0 . 5 V 3 ≃ 86 . 9 V 2 = 1 + 0 . 7 V 3 + 0 . 3 V 1 V 1 = max { 0 . 5 V 2 + 0 . 5 V 1 , 0 . 5 V 3 + 0 . 5 V 1 } V 1 = V 2 = 88 . 3 A. LAZARIC – Markov Decision Processes and Dynamic Programming 35/81
The Markov Decision Process State-Action Value Function Definition In discounted infinite horizon problems, for any policy π , the state-action value function (or Q-function) Q π : X × A �→ R is � � � Q π ( x , a ) = E γ t r ( x t , a t ) | x 0 = x , a 0 = a , a t = π ( x t ) , ∀ t ≥ 1 , t ≥ 0 and the corresponding optimal Q-function is Q ∗ ( x , a ) = max Q π ( x , a ) . π A. LAZARIC – Markov Decision Processes and Dynamic Programming 36/81
The Markov Decision Process State-Action Value Function The relationships between the V-function and the Q-function are: Q π ( x , a ) � p ( y | x , a ) V π ( y ) = r ( x , a ) + γ y ∈ X V π ( x ) Q π ( x , π ( x )) = � Q ∗ ( x , a ) p ( y | x , a ) V ∗ ( y ) = r ( x , a ) + γ y ∈ X V ∗ ( x ) Q ∗ ( x , π ∗ ( x )) = max a ∈ A Q ∗ ( x , a ) . = A. LAZARIC – Markov Decision Processes and Dynamic Programming 37/81
Bellman Equations for Discounted Infinite Horizon Problems Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming 38/81
Bellman Equations for Discounted Infinite Horizon Problems Question Is there any more compact way to describe a value function? ⇒ Bellman equations! A. LAZARIC – Markov Decision Processes and Dynamic Programming 39/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proposition For any stationary policy π = ( π, π, . . . ) , the state value function at a state x ∈ X satisfies the Bellman equation : V π ( x ) = r ( x , π ( x )) + γ � p ( y | x , π ( x )) V π ( y ) . y A. LAZARIC – Markov Decision Processes and Dynamic Programming 40/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proof. For any policy π , V π ( x ) = E � � γ t r ( x t , π ( x t )) | x 0 = x ; π � t ≥ 0 � � γ t r ( x t , π ( x t )) | x 0 = x ; π � = r ( x , π ( x )) + E t ≥ 1 = r ( x , π ( x )) � � � γ t − 1 r ( x t , π ( x t )) | x 1 = y ; π � + γ P ( x 1 = y | x 0 = x ; π ( x 0 )) E y t ≥ 1 � p ( y | x , π ( x )) V π ( y ) . = r ( x , π ( x )) + γ y � A. LAZARIC – Markov Decision Processes and Dynamic Programming 41/81
Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Bellman’s Principle of Optimality [1]: “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” A. LAZARIC – Markov Decision Processes and Dynamic Programming 42/81
Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Proposition The optimal value function V ∗ (i.e., V ∗ = max π V π ) is the solution to the optimal Bellman equation : V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y and the optimal policy is π ∗ ( x ) = arg max � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . a ∈ A y A. LAZARIC – Markov Decision Processes and Dynamic Programming 43/81
Bellman Equations for Discounted Infinite Horizon Problems The Optimal Bellman Equation Proof. For any policy π = ( a , π ′ ) (possibly non-stationary), ( a ) � � V ∗ ( x ) γ t r ( x t , π ( x t )) | x 0 = x ; π � = max E π t ≥ 0 ( b ) � � � p ( y | x , a ) V π ′ ( y ) = max r ( x , a ) + γ ( a ,π ′ ) y ( c ) � � � π ′ V π ′ ( y ) = max r ( x , a ) + γ p ( y | x , a ) max a y ( d ) � � � p ( y | x , a ) V ∗ ( y ) = max r ( x , a ) + γ . a y � A. LAZARIC – Markov Decision Processes and Dynamic Programming 44/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Notation. w.l.o.g. a discrete state space | X | = N and V π ∈ R N . Definition For any W ∈ R N , the Bellman operator T π : R N → R N is T π W ( x ) = r ( x , π ( x )) + γ � p ( y | x , π ( x )) W ( y ) , y and the optimal Bellman operator (or dynamic programming operator) is � � � T W ( x ) = max a ∈ A r ( x , a ) + γ p ( y | x , a ) W ( y ) . y A. LAZARIC – Markov Decision Processes and Dynamic Programming 45/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 T π W 2 , ≤ T W 1 ≤ T W 2 . A. LAZARIC – Markov Decision Processes and Dynamic Programming 46/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 T π W 2 , ≤ T W 1 ≤ T W 2 . 2. Offset : for any scalar c ∈ R , T π ( W + cI N ) T π W + γ cI N , = T ( W + cI N ) = T W + γ cI N , A. LAZARIC – Markov Decision Processes and Dynamic Programming 46/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L ∞ -norm : for any W 1 , W 2 ∈ R N ||T π W 1 − T π W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ , ||T W 1 − T W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ . A. LAZARIC – Markov Decision Processes and Dynamic Programming 47/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L ∞ -norm : for any W 1 , W 2 ∈ R N ||T π W 1 − T π W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ , ||T W 1 − T W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ . 4. Fixed point : For any policy π V π is the unique fixed point of T π , V ∗ is the unique fixed point of T . A. LAZARIC – Markov Decision Processes and Dynamic Programming 47/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Operators Proposition 3. Contraction in L ∞ -norm : for any W 1 , W 2 ∈ R N ||T π W 1 − T π W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ , ||T W 1 − T W 2 || ∞ ≤ γ || W 1 − W 2 || ∞ . 4. Fixed point : For any policy π V π is the unique fixed point of T π , V ∗ is the unique fixed point of T . Furthermore for any W ∈ R N and any stationary policy π k →∞ ( T π ) k W V π , lim = k →∞ ( T ) k W V ∗ . lim = A. LAZARIC – Markov Decision Processes and Dynamic Programming 47/81
Bellman Equations for Discounted Infinite Horizon Problems The Bellman Equation Proof. The contraction property (3) holds since for any x ∈ X we have |T W 1 ( x ) − T W 2 ( x ) | � �� � � � � � r ( x , a ′ ) + γ p ( y | x , a ′ ) W 2 ( y ) = � max r ( x , a ) + γ p ( y | x , a ) W 1 ( y ) − max � � � a a ′ y y ( a ) � �� � � � � � ≤ max r ( x , a ) + γ p ( y | x , a ) W 1 ( y ) − r ( x , a ) + γ p ( y | x , a ) W 2 ( y ) � � � � a y y � = γ max p ( y | x , a ) | W 1 ( y ) − W 2 ( y ) | a y � ≤ γ || W 1 − W 2 || ∞ max p ( y | x , a ) = γ || W 1 − W 2 || ∞ , a y where in ( a ) we used max a f ( a ) − max a ′ g ( a ′ ) ≤ max a ( f ( a ) − g ( a )) . � A. LAZARIC – Markov Decision Processes and Dynamic Programming 48/81
Bellman Equations for Discounted Infinite Horizon Problems Exercise: Fixed Point Revise the Banach fixed point theorem and prove the fixed point property of the Bellman operator. A. LAZARIC – Markov Decision Processes and Dynamic Programming 49/81
Bellman Equations for Uniscounted Infinite Horizon Problems Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming 50/81
Bellman Equations for Uniscounted Infinite Horizon Problems Question Is there any more compact way to describe a value function when we consider an infinite horizon with no discount? ⇒ Proper policies and Bellman equations! A. LAZARIC – Markov Decision Processes and Dynamic Programming 51/81
Bellman Equations for Uniscounted Infinite Horizon Problems The Undiscounted Infinite Horizon Setting The value function is � T � � V π ( x ) = E r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first random time when the agent achieves a terminal state. A. LAZARIC – Markov Decision Processes and Dynamic Programming 52/81
Bellman Equations for Uniscounted Infinite Horizon Problems Proper Policies Definition A stationary policy π is proper if ∃ n ∈ N such that ∀ x ∈ X the probability of achieving the terminal state ¯ x after n steps is strictly positive. That is ρ π = max x P ( x n � = ¯ x | x 0 = x , π ) < 1 . A. LAZARIC – Markov Decision Processes and Dynamic Programming 53/81
Bellman Equations for Uniscounted Infinite Horizon Problems Bounded Value Function Proposition For any proper policy π with parameter ρ π after n steps, the value function is bounded as || V π || ∞ ≤ r max � ρ ⌊ t / n ⌋ . π t ≥ 0 A. LAZARIC – Markov Decision Processes and Dynamic Programming 54/81
Bellman Equations for Uniscounted Infinite Horizon Problems The Undiscounted Infinite Horizon Setting Proof. By definition of proper policy x | x 0 = x , π ) ≤ ρ 2 P ( x 2 n � = ¯ x | x 0 = x , π ) = P ( x 2 n � = ¯ x | x n � = ¯ x , π ) × P ( x n � = ¯ π . Then for any t ∈ N x | x 0 = x , π ) ≤ ρ ⌊ t / n ⌋ P ( x t � = ¯ , π which implies that eventually the terminal state ¯ x is achieved with probability 1. Then � ∞ || V π || ∞ = max � � r ( x t , π ( x t )) | x 0 = x ; π x ∈ X E t = 0 � ≤ r max P ( x t � = ¯ x | x 0 = x , π ) t > 0 � ρ ⌊ t / n ⌋ ≤ nr max + r max . π t ≥ n � A. LAZARIC – Markov Decision Processes and Dynamic Programming 55/81
Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Assumption. There exists at least one proper policy and for any non-proper policy π there exists at least one state x where V π ( x ) = −∞ (cycles with only negative rewards). Proposition ([2]) Under the previous assumption, the optimal value function is bounded, i.e., || V ∗ || ∞ < ∞ and it is the unique fixed point of the optimal Bellman operator T such that for any vector W ∈ R n � � T W ( x ) = max r ( x , a ) + p ( y | x , a ) W ( y )] . a ∈ A y Furthermore V ∗ = lim k →∞ ( T ) k W . A. LAZARIC – Markov Decision Processes and Dynamic Programming 56/81
Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proposition Let all the policies π be proper , then there exist µ ∈ R N with µ > 0 and a scalar β < 1 such that, ∀ x , y ∈ X , ∀ a ∈ A , � p ( y | x , a ) µ ( y ) ≤ βµ ( x ) . y Thus both operators T and T π are contraction in the weighted norm L ∞ ,µ , that is ||T W 1 − T W 2 || ∞ ,µ ≤ β || W 1 − W 2 || ∞ ,µ . A. LAZARIC – Markov Decision Processes and Dynamic Programming 57/81
Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proof. Let µ be the maximum (over policies) of the average time to the termination state. This can be easily casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x ∈ X and a ∈ A , r ( x , a ) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dynamic programming equation � µ ( x ) = 1 + max p ( y | x , a ) µ ( y ) . a y Then µ ( x ) ≥ 1 and for any a ∈ A , µ ( x ) ≥ 1 + � y p ( y | x , a ) µ ( y ) . Furthermore, � p ( y | x , a ) µ ( y ) ≤ µ ( x ) − 1 ≤ βµ ( x ) , y for µ ( x ) − 1 β = max < 1 . µ ( x ) x A. LAZARIC – Markov Decision Processes and Dynamic Programming 58/81
Bellman Equations for Uniscounted Infinite Horizon Problems Bellman Operator Proof (cont’d). From this definition of µ and β we obtain the contraction property of T (similar for T π ) in norm L ∞ ,µ : |T W 1 ( x ) − T W 2 ( x ) | ||T W 1 − T W 2 || ∞ ,µ = max µ ( x ) x � y p ( y | x , a ) ≤ max | W 1 ( y ) − W 2 ( y ) | µ ( x ) x , a � y p ( y | x , a ) µ ( y ) ≤ max � W 1 − W 2 � µ µ ( x ) x , a ≤ β � W 1 − W 2 � µ � A. LAZARIC – Markov Decision Processes and Dynamic Programming 59/81
Dynamic Programming Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming 60/81
Dynamic Programming Question How do we compute the value functions / solve an MDP? ⇒ Value/Policy Iteration algorithms! A. LAZARIC – Markov Decision Processes and Dynamic Programming 61/81
Dynamic Programming System of Equations The Bellman equation V π ( x ) = r ( x , π ( x )) + γ � p ( y | x , π ( x )) V π ( y ) . y is a linear system of equations with N unknowns and N linear constraints. A. LAZARIC – Markov Decision Processes and Dynamic Programming 62/81
Dynamic Programming System of Equations The Bellman equation V π ( x ) = r ( x , π ( x )) + γ � p ( y | x , π ( x )) V π ( y ) . y is a linear system of equations with N unknowns and N linear constraints. The optimal Bellman equation V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator). A. LAZARIC – Markov Decision Processes and Dynamic Programming 62/81
Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC – Markov Decision Processes and Dynamic Programming 63/81
Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K A. LAZARIC – Markov Decision Processes and Dynamic Programming 63/81
Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = T V k A. LAZARIC – Markov Decision Processes and Dynamic Programming 63/81
Dynamic Programming Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = T V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Markov Decision Processes and Dynamic Programming 63/81
Recommend
More recommend