Lecture 2: Markov DecisionProcesses Markov Property Markov Processes Markov Property “The future is independent of the past given the present” Definition A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 ,..., S t ] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future 295, Winter 2018 42
Lecture 2: Markov DecisionProcesses State Transition Matrix Markov Processes Markov Property where each row of the matrix sums to 1. 295, Winter 2018 43
Lecture 2: Markov DecisionProcesses Markov Process Markov Processes Markov Chains A Markov process is a memoryless random process, i.e. a sequence of random states S 1 , S 2 , ... with the Markov property. Definition A Markov Process (or Markov Chain ) is a tuple (S , P) S is a (finite) set of states P is a state transition probability matrix, P ss ' = P [ S t +1 = s ' | S t = s ] 295, Winter 2018 44
Lecture 2: Markov DecisionProcesses Example: Student Markov Chain, a transition graph Markov Processes Markov Chains 0.9 Sleep Facebook 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 45
Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Episodes Markov Processes Markov Chains Sample episodes for Student Markov Chain starting from S 1 = C1 S 1 , S 2 ,..., S T 0.9 Sleep Facebook 0.1 C1 C2 C3 Pass Sleep 1.0 0.5 0.2 Class 3 0.6 Class 1 0.5 Class 2 0.8 Pass C1 FB FB C1 C2 Sleep 0.4 C1 C2 C3 Pub C2 C3 Pass Sleep 0.4 0.2 0.4 C1 FB FB C1 C2 C3 Pub C1 FB FB Pub FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 46
Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Transition Matrix Markov Processes Markov Chains 0.9 Facebook Sleep 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 47
Markov Decision Processes • States: S • Model: T(s,a,s ’) = P(s’| s,a) • Actions: A(s), A • Reward: R(s), R(s,a), R(s,a,s ’) • Discount: 𝛿 • Policy: 𝜌 𝑡 → 𝑏 • Utility/Value: sum of discounted rewards. • We seek optimal policy that maximizes the expected total (discounted) reward 295, Winter 2018 48
Lecture 2: Markov DecisionProcesses Example: Student MRP Markov Reward Processes MRP 0.9 Sleep Facebook 0.1 R =-1 R =0 1.0 0.5 0.2 Class 1 0.5 Class 3 0.6 Class 2 0.8 Pass R = -2 R = -2 R =-2 0.4 R =+10 0.4 0.2 0.4 Pub R =+1 49
Goals, Returns and Rewards • The agent’s goal is to maximize the total amount of rewards it gets (not immediate ones), relative to the long run. • Reward is -1 typically in mazes for every time step • Deciding how to associate rewards with states is part of the problem modelling. If T is the final step then the return is: 295, Winter 2018 50
Lecture 2: Markov DecisionProcesses Return Markov Reward Processes Return Definition The return G t is the total discounted reward from time-step t . The discount γ ∈ [0 , 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γ k R . This values immediate reward above delayed reward. γ close to 0 leads to ”myopic” evaluation γ close to 1 leads to ”far - sighted” evaluation 295, Winter 2018 51
Lecture 2: Markov DecisionProcesses Why discount? Markov Reward Processes Return Most Markov reward and decision processes are discounted. Why? Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future may not be fully represented If the reward is financial, immediate rewards may earn more interest than delayed rewards Animal/human behaviour shows preference for immediate reward It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate. 295, Winter 2018 52
Lecture 2: Markov DecisionProcesses Value Function Markov Reward Processes Value Function The value function v ( s ) gives the long-term value of state s Definition The state value function v ( s ) of an MRP is the expected return starting from state s v ( s ) = E [ G t | S t = s ] 295, Winter 2018 53
Lecture 2: Markov DecisionProcesses Example: Student MRP Returns Markov Reward Processes Value Function Sample returns for Student MRP: Starting from S 1 = C1 with γ = 1 2 G 1 = R 2 + γ R 3 + ... + γ T −2 R T C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 ... FB FB FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 54
Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs Markov Reward Processes Bellman Equation The value function can be decomposed into two parts: immediate reward R t +1 discounted value of successor state γ v ( S t +1 ) v ( s ) = E [ G t | S t = s ] 2 + ... | S = s] + γ R = E [ R + γ R t +1 t +2 t +3 t = E [ R t +1 + γ ( R t +2 + γ R t +3 + ... ) | S t = s ] = E [ R t +1 + γ G t +1 | S t = s ] = E [ R t +1 + γ v ( S t +1 ) | S t = s ] 295, Winter 2018 55
Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs (2) Markov Reward Processes Bellman Equation 295, Winter 2018 56
Lecture 2: Markov DecisionProcesses Example: Bellman Equation for Student MRP Markov Reward Processes Bellman Equation 4.3 = -2 + 0.6*10 +0.4*0.8 0.9 -23 0 0.1 R =-1 R =0 1.0 0.5 0.2 0.6 0.8 0.5 -13 1.5 4.3 10 R = -2 R =-2 R =-2 0.4 R =+10 0.4 0.2 0.4 0.8 R =+1 57
Lecture 2: Markov DecisionProcesses Markov Reward Processes Bellman Equation in Matrix Form Bellman Equation The Bellman equation can be expressed concisely using matrices, v = R + γ P v where v is a column vector with one entry per state 295, Winter 2018 58
Lecture 2: Markov DecisionProcesses Solving the Bellman Equation Markov Reward Processes Bellman Equation The Bellman equation is a linear equation It can be solved directly: v = R + γ P v ( I − γ P ) v = R v = ( I − γ P ) −1 R Computational complexity is O ( n 3 ) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 295, Winter 2018 59
Lecture 2: Markov DecisionProcesses Markov Decision Process Markov Decision Processes MDP 295, Winter 2018 60
Lecture 2: Markov DecisionProcesses Example: Student MDP Markov Decision Processes MDP Facebook R = -1 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 61
Lecture 2: Markov DecisionProcesses Policies and Value functions (1) Markov Decision Processes Policies Definition A policy π is a distribution over actions givenstates, π ( a | s ) = P [ A t = a | S t = s ] A policy fully defines the behaviour of an agent MDP policies depend on the current state (not the history) i.e. Policies are stationary (time-independent), A t ∼ π ( ·| S t ) , ∀ > 0 t 295, Winter 2018 62
Policy’s and Value functions 295, Winter 2018 63
Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Prediction Problems withinRL Actions: up, down, left, right. Rewards 0 unless off the grid with reward -1 From A to A’, rewatd +10. from B to B’ reward +5 Policy: actions are uniformly random. A B 3.3 8.8 4.4 5.3 1.5 +5 1.5 3.0 2.3 1.9 0.5 +10 B’ 0.1 0.7 0.7 0.4 -0.4 Figure 3.3 -1.0 -0.4 -0.4 -0.6 -1.2 Actions A’ -1.9 -1.3 -1.2 -1.4 -2.0 (a) (b) What is the value function for the uniform random policy? Gamma=0.9. solved using EQ. 3.14 Exercise: show 3.14 holds for each state in Figure (b). 64
Lecture 2: Markov DecisionProcesses Value Function, Q Functions Markov Decision Processes Value Functions Definition The state-value function v π ( s ) of an MDP is the expected return starting from state s , and then following policy π v π ( s ) = E π [ G t | S t = s ] Definition The action-value function q π ( s , a ) is the expected return starting from state s , taking action a , and then following policy π q π ( s , a ) = E π [ G t | S t = s , A t = a ] 295, Winter 2018 65
Lecture 2: Markov DecisionProcesses Bellman Expectation Equation Markov Decision Processes Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, v π ( s ) = E π [ R t +1 + γ v π ( S t +1 ) | S t = s ] The action-value function can similarly be decomposed, q π ( s , a ) = E π [ R t +1 + γ q π ( S t +1 , A t +1 ) | S t = s , A t = a ] Expressing the functions recursively, Will translate to one step look-ahead. 295, Winter 2018 66
Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for V π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 67
Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for Q π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 68
Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for v π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 69
Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for q π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 70
Lecture 2: Markov DecisionProcesses Optimal Policies and Optimal Value Function Markov Decision Processes Optimal Value Functions Definition The optimal state-value function v ∗ ( s ) is the maximum value function over all policies v ( s ) = max v ( s ) ∗ π π The optimal action-value function q ∗ ( s , a ) is the maximum action-value function over all policies q ( s , a ) = max q ( s , a ) ∗ π π The optimal value function specifies the best possible performance in the MDP. An MDP is “solved” when we know the optimal value function.
Lecture 2: Markov DecisionProcesses Optimal Value Function for Student MDP Markov Decision Processes Optimal Value Functions v*(s) for γ =1 Facebook R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 72
Lecture 2: Markov DecisionProcesses Optimal Action-Value Function for Student MDP Markov Decision Processes Optimal Value Functions q*(s,a) for γ =1 Facebook R = -1 q * =5 6 0 Quit Facebook Sleep R =-1 R =0 R =0 q*=5 q* =0 q*=6 Study R =+10 Study Study 6 8 10 q* =10 R =-2 R =-2 q*=6 q*=8 Pub R =+1 0.4 q*=8.4 0.2 0.4 295, Winter 2018 73
Lecture 2: Markov DecisionProcesses Optimal Policy Markov Decision Processes Optimal Value Functions Define a partial ordering over policies π ≥ π ' if v π ( s ) ≥ v π ' ( s ) , ∀ s Theorem For any Markov Decision Process There exists an optimal policy π that is better than or equal ∗ ∗ ≥ π, ∀ to all other policies, π π All optimal policies achieve the optimal value function, v π ∗ ( s ) = v ∗ ( s ) All optimal policies achieve the optimal action-value function, q π ∗ ( s , a ) = q ∗ ( s , a ) 295, Winter 2018 74
Lecture 2: Markov DecisionProcesses Finding an Optimal Policy Markov Decision Processes Optimal Value Functions An optimal policy can be found by maximising over q ∗ ( s , a ), There is always a deterministic optimal policy for any MDP If we know q ∗ ( s , a ), we immediately have the optimal policy 295, Winter 2018 75
Bellman Equation for V* and Q* V*(s) q*(s; a) 295, Winter 2018 77
Lecture 2: Markov DecisionProcesses Example: Bellman Optimality Equation in Student MDP Markov Decision Processes Bellman Optimality Equation Facebook 6 = max {-2 + 8, -1 + 6} R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 78
Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Control Problems withinRL A B 22.0 24.4 22.0 19.4 17.5 +5 19.8 22.0 19.8 17.8 16.0 +10 B’ 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 A’ 14.4 16.0 14.4 13.0 11.7 b) v c) ⇡ π a) gridworld V * * What is the optimal value function over all possible policies? What is the optimal policy? Figure 3.6 295, Winter 2018 79
Lecture 2: Markov DecisionProcesses Solving the Bellman Optimality Equation Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa 295, Winter 2018 80
Planning by Dynamic Programming Sutton & Barto, Chapter 4 295, Winter 2018 81
Lecture 3: Planning by Dynamic Programming Planning by Dynamic Programming Introduction Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction: Input: MDP (S , A , P , R , γ ) and policy π or: MRP (S , P π , R π , γ ) Output: value function v π Or for control: Input: MDP (S , A , P , R , γ ) Output: optimal value function v ∗ and: optimal policy π ∗ 295, Winter 2018 83
Lecture 3: Planning by Dynamic Programming Policy Evaluation (Prediction) Policy Evaluation Iterative Policy Evaluation Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v 1 → v 2 → ... → v π Using synchronous backups, At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) where s ' is a successor state of s We will discuss asynchronous backups later Convergence to v π will be proven at the end of the lecture 295, Winter 2018 84
Iterative Policy Evaluations These is a simultaneous linear equations in ISI unknowns and can be solved. Practically an iterative procedure until a foxed-point can be more effective Iterative policy evaluation. 295, Winter 2018 85
Iterative policy Evaluation 295, Winter 2018 87
Lecture 3: Planning by Dynamic Programming Evaluating a Random Policy in the Small Gridworld Policy Evaluation Example: Small Gridworld Undiscounted episodic MDP ( γ = 1) Nonterminal states 1 , ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is − 1 until the terminal state is reached Agent follows uniform random policy π ( n |· ) = π ( e |· ) = π ( s |· ) = π ( w |· ) = 0 . 25 295, Winter 2018 88
Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld Policy Evaluation Example: Small Gridworld v k for the V k Greedy Policy w.r.t. v k V k Random Policy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 random k = 0 0.0 0.0 0.0 0.0 policy 0.0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 k = 1 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 0.0 -1.7 -2.0 -2.0 -1.7 -2.0 -2.0 -2.0 k = 2 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 0.0 295, Winter 2018 89
Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld (2) Policy Evaluation Example: Small Gridworld 0.0 -2.4 -2.9 -3.0 -2.4 -2.9 -3.0 -2.9 k = 3 -2.9 -3.0 -2.9 -2.4 -3.0 -2.9 -2.4 0.0 0.0 -6.1 -8.4 -9.0 -6.1 -7.7 -8.4 -8.4 optimal k = 10 policy -8.4 -8.4 -7.7 -6.1 -9.0 -8.4 -6.1 0.0 0.0 -14. -20. -22. -14. -18. -20. -20. k = ∞ -20. -20. -18. -14. -22. -20. -14. 0.0 295, Winter 2018 90
Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Given a policy π Evaluate the policy π v π ( s ) = E [ R t +1 + γ R t +2 + ... | S t = s ] Improve the policy by acting greedily with respect to v π π ' = greedy( v π ) In Small Gridworld improved policy was optimal, π ' = π ∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π ∗ 295, Winter 2018 91
Policy Iteration 295, Winter 2018 92
Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate v π Iterative policy evaluation Policy improvement Generate π I ≥ π Greedy policy improvement 295, Winter 2018 93
Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Policy Improvement 295, Winter 2018 94
Lecture 3: Planning by Dynamic Programming Policy Improvement (2) Policy Iteration Policy Improvement If improvements stop, q π ( s , π ' ( s )) = max q π ( s , a ) = q π ( s , π ( s )) = v π ( s ) a ∈A Then the Bellman optimality equation has been satisfied v π ( s ) = max q π ( s , a ) a ∈A Therefore v π ( s ) = v ∗ ( s ) for all s ∈ S so π is an optimal policy 295, Winter 2018 95
Lecture 3: Planning by Dynamic Programming Modified Policy Iteration Policy Iteration Extensions to Policy Iteration Does policy evaluation need to converge to v π ? Or should we introduce a stopping condition e.g. E -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section) 295, Winter 2018 96
Lecture 3: Planning by Dynamic Programming Generalised Policy Iteration Policy Iteration Extensions to Policy Iteration Policy evaluation Estimate v π Any policy evaluation algorithm Policy improvement Generate π ' ≥ π Any policy improvement algorithm 295, Winter 2018 97
Lecture 3: Planning by Dynamic Programming Principle of Optimality Value Iteration Value Iteration in MDPs Any optimal policy can be subdivided into two components: An optimal first action A ∗ Followed by an optimal policy from successor state S I Theorem (Principle of Optimality) A policy π ( a | s ) achieves the optimal value from state s, v π ( s ) = v ∗ ( s ) , if and onlyif For any state s ' reachable from s π achieves the optimal value from state s ' , v π ( s ' ) = v ∗ ( s ' ) 295, Winter 2018 98
Lecture 3: Planning by Dynamic Programming Deterministic Value Iteration Value Iteration Value Iteration in MDPs 295, Winter 2018 99
Value Iteration 295, Winter 2018 100
Value Iteration 295, Winter 2018 101
Lecture 3: Planning by Dynamic Programming Example: Shortest Path Value Iteration Value Iteration in MDPs g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2 0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 Problem V 1 V 2 V 3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 -1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4 -2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5 -3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6 V 4 V 5 V 6 V 7 295, Winter 2018 102
Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration Value Iteration in MDPs Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v 1 → v 2 → ... → v ∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) Convergence to v will be proven later ∗ Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy 295, Winter 2018 103
Lecture 3: Planning by Dynamic Programming Value Iteration (2) Value Iteration Value Iteration in MDPs 295, Winter 2018 104
Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected 295, Winter 2018 106
Recommend
More recommend