or
play

or, Learning and Planning with Markov Decision Processes 295 - PowerPoint PPT Presentation

Reinforcement Learning or, Learning and Planning with Markov Decision Processes 295 Seminar, Winter 2018 Rina Dechter Slides will follow David Silvers, and Suttons book Goals: To learn together the basics of RL. Some lectures and classic


  1. Lecture 2: Markov DecisionProcesses Markov Property Markov Processes Markov Property “The future is independent of the past given the present” Definition A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 ,..., S t ] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future 295, Winter 2018 42

  2. Lecture 2: Markov DecisionProcesses State Transition Matrix Markov Processes Markov Property where each row of the matrix sums to 1. 295, Winter 2018 43

  3. Lecture 2: Markov DecisionProcesses Markov Process Markov Processes Markov Chains A Markov process is a memoryless random process, i.e. a sequence of random states S 1 , S 2 , ... with the Markov property. Definition A Markov Process (or Markov Chain ) is a tuple (S , P) S is a (finite) set of states P is a state transition probability matrix, P ss ' = P [ S t +1 = s ' | S t = s ] 295, Winter 2018 44

  4. Lecture 2: Markov DecisionProcesses Example: Student Markov Chain, a transition graph Markov Processes Markov Chains 0.9 Sleep Facebook 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 45

  5. Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Episodes Markov Processes Markov Chains Sample episodes for Student Markov Chain starting from S 1 = C1 S 1 , S 2 ,..., S T 0.9 Sleep Facebook 0.1 C1 C2 C3 Pass Sleep 1.0 0.5 0.2 Class 3 0.6 Class 1 0.5 Class 2 0.8 Pass C1 FB FB C1 C2 Sleep 0.4 C1 C2 C3 Pub C2 C3 Pass Sleep 0.4 0.2 0.4 C1 FB FB C1 C2 C3 Pub C1 FB FB Pub FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 46

  6. Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Transition Matrix Markov Processes Markov Chains 0.9 Facebook Sleep 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 47

  7. Markov Decision Processes • States: S • Model: T(s,a,s ’) = P(s’| s,a) • Actions: A(s), A • Reward: R(s), R(s,a), R(s,a,s ’) • Discount: 𝛿 • Policy: 𝜌 𝑡 → 𝑏 • Utility/Value: sum of discounted rewards. • We seek optimal policy that maximizes the expected total (discounted) reward 295, Winter 2018 48

  8. Lecture 2: Markov DecisionProcesses Example: Student MRP Markov Reward Processes MRP 0.9 Sleep Facebook 0.1 R =-1 R =0 1.0 0.5 0.2 Class 1 0.5 Class 3 0.6 Class 2 0.8 Pass R = -2 R = -2 R =-2 0.4 R =+10 0.4 0.2 0.4 Pub R =+1 49

  9. Goals, Returns and Rewards • The agent’s goal is to maximize the total amount of rewards it gets (not immediate ones), relative to the long run. • Reward is -1 typically in mazes for every time step • Deciding how to associate rewards with states is part of the problem modelling. If T is the final step then the return is: 295, Winter 2018 50

  10. Lecture 2: Markov DecisionProcesses Return Markov Reward Processes Return Definition The return G t is the total discounted reward from time-step t . The discount γ ∈ [0 , 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γ k R . This values immediate reward above delayed reward. γ close to 0 leads to ”myopic” evaluation γ close to 1 leads to ”far - sighted” evaluation 295, Winter 2018 51

  11. Lecture 2: Markov DecisionProcesses Why discount? Markov Reward Processes Return Most Markov reward and decision processes are discounted. Why? Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future may not be fully represented If the reward is financial, immediate rewards may earn more interest than delayed rewards Animal/human behaviour shows preference for immediate reward It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate. 295, Winter 2018 52

  12. Lecture 2: Markov DecisionProcesses Value Function Markov Reward Processes Value Function The value function v ( s ) gives the long-term value of state s Definition The state value function v ( s ) of an MRP is the expected return starting from state s v ( s ) = E [ G t | S t = s ] 295, Winter 2018 53

  13. Lecture 2: Markov DecisionProcesses Example: Student MRP Returns Markov Reward Processes Value Function Sample returns for Student MRP: Starting from S 1 = C1 with γ = 1 2 G 1 = R 2 + γ R 3 + ... + γ T −2 R T C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 ... FB FB FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 54

  14. Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs Markov Reward Processes Bellman Equation The value function can be decomposed into two parts: immediate reward R t +1 discounted value of successor state γ v ( S t +1 ) v ( s ) = E [ G t | S t = s ] 2 + ... | S = s] + γ R = E [ R + γ R t +1 t +2 t +3 t = E [ R t +1 + γ ( R t +2 + γ R t +3 + ... ) | S t = s ] = E [ R t +1 + γ G t +1 | S t = s ] = E [ R t +1 + γ v ( S t +1 ) | S t = s ] 295, Winter 2018 55

  15. Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs (2) Markov Reward Processes Bellman Equation 295, Winter 2018 56

  16. Lecture 2: Markov DecisionProcesses Example: Bellman Equation for Student MRP Markov Reward Processes Bellman Equation 4.3 = -2 + 0.6*10 +0.4*0.8 0.9 -23 0 0.1 R =-1 R =0 1.0 0.5 0.2 0.6 0.8 0.5 -13 1.5 4.3 10 R = -2 R =-2 R =-2 0.4 R =+10 0.4 0.2 0.4 0.8 R =+1 57

  17. Lecture 2: Markov DecisionProcesses Markov Reward Processes Bellman Equation in Matrix Form Bellman Equation The Bellman equation can be expressed concisely using matrices, v = R + γ P v where v is a column vector with one entry per state 295, Winter 2018 58

  18. Lecture 2: Markov DecisionProcesses Solving the Bellman Equation Markov Reward Processes Bellman Equation The Bellman equation is a linear equation It can be solved directly: v = R + γ P v ( I − γ P ) v = R v = ( I − γ P ) −1 R Computational complexity is O ( n 3 ) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 295, Winter 2018 59

  19. Lecture 2: Markov DecisionProcesses Markov Decision Process Markov Decision Processes MDP 295, Winter 2018 60

  20. Lecture 2: Markov DecisionProcesses Example: Student MDP Markov Decision Processes MDP Facebook R = -1 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 61

  21. Lecture 2: Markov DecisionProcesses Policies and Value functions (1) Markov Decision Processes Policies Definition A policy π is a distribution over actions givenstates, π ( a | s ) = P [ A t = a | S t = s ] A policy fully defines the behaviour of an agent MDP policies depend on the current state (not the history) i.e. Policies are stationary (time-independent), A t ∼ π ( ·| S t ) , ∀ > 0 t 295, Winter 2018 62

  22. Policy’s and Value functions 295, Winter 2018 63

  23. Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Prediction Problems withinRL Actions: up, down, left, right. Rewards 0 unless off the grid with reward -1 From A to A’, rewatd +10. from B to B’ reward +5 Policy: actions are uniformly random. A B 3.3 8.8 4.4 5.3 1.5 +5 1.5 3.0 2.3 1.9 0.5 +10 B’ 0.1 0.7 0.7 0.4 -0.4 Figure 3.3 -1.0 -0.4 -0.4 -0.6 -1.2 Actions A’ -1.9 -1.3 -1.2 -1.4 -2.0 (a) (b) What is the value function for the uniform random policy? Gamma=0.9. solved using EQ. 3.14 Exercise: show 3.14 holds for each state in Figure (b). 64

  24. Lecture 2: Markov DecisionProcesses Value Function, Q Functions Markov Decision Processes Value Functions Definition The state-value function v π ( s ) of an MDP is the expected return starting from state s , and then following policy π v π ( s ) = E π [ G t | S t = s ] Definition The action-value function q π ( s , a ) is the expected return starting from state s , taking action a , and then following policy π q π ( s , a ) = E π [ G t | S t = s , A t = a ] 295, Winter 2018 65

  25. Lecture 2: Markov DecisionProcesses Bellman Expectation Equation Markov Decision Processes Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, v π ( s ) = E π [ R t +1 + γ v π ( S t +1 ) | S t = s ] The action-value function can similarly be decomposed, q π ( s , a ) = E π [ R t +1 + γ q π ( S t +1 , A t +1 ) | S t = s , A t = a ] Expressing the functions recursively, Will translate to one step look-ahead. 295, Winter 2018 66

  26. Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for V π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 67

  27. Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for Q π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 68

  28. Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for v π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 69

  29. Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for q π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 70

  30. Lecture 2: Markov DecisionProcesses Optimal Policies and Optimal Value Function Markov Decision Processes Optimal Value Functions Definition The optimal state-value function v ∗ ( s ) is the maximum value function over all policies v ( s ) = max v ( s ) ∗ π π The optimal action-value function q ∗ ( s , a ) is the maximum action-value function over all policies q ( s , a ) = max q ( s , a ) ∗ π π The optimal value function specifies the best possible performance in the MDP. An MDP is “solved” when we know the optimal value function.

  31. Lecture 2: Markov DecisionProcesses Optimal Value Function for Student MDP Markov Decision Processes Optimal Value Functions v*(s) for γ =1 Facebook R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 72

  32. Lecture 2: Markov DecisionProcesses Optimal Action-Value Function for Student MDP Markov Decision Processes Optimal Value Functions q*(s,a) for γ =1 Facebook R = -1 q * =5 6 0 Quit Facebook Sleep R =-1 R =0 R =0 q*=5 q* =0 q*=6 Study R =+10 Study Study 6 8 10 q* =10 R =-2 R =-2 q*=6 q*=8 Pub R =+1 0.4 q*=8.4 0.2 0.4 295, Winter 2018 73

  33. Lecture 2: Markov DecisionProcesses Optimal Policy Markov Decision Processes Optimal Value Functions Define a partial ordering over policies π ≥ π ' if v π ( s ) ≥ v π ' ( s ) , ∀ s Theorem For any Markov Decision Process There exists an optimal policy π that is better than or equal ∗ ∗ ≥ π, ∀ to all other policies, π π All optimal policies achieve the optimal value function, v π ∗ ( s ) = v ∗ ( s ) All optimal policies achieve the optimal action-value function, q π ∗ ( s , a ) = q ∗ ( s , a ) 295, Winter 2018 74

  34. Lecture 2: Markov DecisionProcesses Finding an Optimal Policy Markov Decision Processes Optimal Value Functions An optimal policy can be found by maximising over q ∗ ( s , a ), There is always a deterministic optimal policy for any MDP If we know q ∗ ( s , a ), we immediately have the optimal policy 295, Winter 2018 75

  35. Bellman Equation for V* and Q* V*(s) q*(s; a) 295, Winter 2018 77

  36. Lecture 2: Markov DecisionProcesses Example: Bellman Optimality Equation in Student MDP Markov Decision Processes Bellman Optimality Equation Facebook 6 = max {-2 + 8, -1 + 6} R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 78

  37. Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Control Problems withinRL A B 22.0 24.4 22.0 19.4 17.5 +5 19.8 22.0 19.8 17.8 16.0 +10 B’ 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 A’ 14.4 16.0 14.4 13.0 11.7 b) v c) ⇡ π a) gridworld V * * What is the optimal value function over all possible policies? What is the optimal policy? Figure 3.6 295, Winter 2018 79

  38. Lecture 2: Markov DecisionProcesses Solving the Bellman Optimality Equation Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa 295, Winter 2018 80

  39. Planning by Dynamic Programming Sutton & Barto, Chapter 4 295, Winter 2018 81

  40. Lecture 3: Planning by Dynamic Programming Planning by Dynamic Programming Introduction Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction: Input: MDP (S , A , P , R , γ ) and policy π or: MRP (S , P π , R π , γ ) Output: value function v π Or for control: Input: MDP (S , A , P , R , γ ) Output: optimal value function v ∗ and: optimal policy π ∗ 295, Winter 2018 83

  41. Lecture 3: Planning by Dynamic Programming Policy Evaluation (Prediction) Policy Evaluation Iterative Policy Evaluation Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v 1 → v 2 → ... → v π Using synchronous backups, At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) where s ' is a successor state of s We will discuss asynchronous backups later Convergence to v π will be proven at the end of the lecture 295, Winter 2018 84

  42. Iterative Policy Evaluations These is a simultaneous linear equations in ISI unknowns and can be solved. Practically an iterative procedure until a foxed-point can be more effective Iterative policy evaluation. 295, Winter 2018 85

  43. Iterative policy Evaluation 295, Winter 2018 87

  44. Lecture 3: Planning by Dynamic Programming Evaluating a Random Policy in the Small Gridworld Policy Evaluation Example: Small Gridworld Undiscounted episodic MDP ( γ = 1) Nonterminal states 1 , ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is − 1 until the terminal state is reached Agent follows uniform random policy π ( n |· ) = π ( e |· ) = π ( s |· ) = π ( w |· ) = 0 . 25 295, Winter 2018 88

  45. Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld Policy Evaluation Example: Small Gridworld v k for the V k Greedy Policy w.r.t. v k V k Random Policy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 random k = 0 0.0 0.0 0.0 0.0 policy 0.0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 k = 1 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 0.0 -1.7 -2.0 -2.0 -1.7 -2.0 -2.0 -2.0 k = 2 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 0.0 295, Winter 2018 89

  46. Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld (2) Policy Evaluation Example: Small Gridworld 0.0 -2.4 -2.9 -3.0 -2.4 -2.9 -3.0 -2.9 k = 3 -2.9 -3.0 -2.9 -2.4 -3.0 -2.9 -2.4 0.0 0.0 -6.1 -8.4 -9.0 -6.1 -7.7 -8.4 -8.4 optimal k = 10 policy -8.4 -8.4 -7.7 -6.1 -9.0 -8.4 -6.1 0.0 0.0 -14. -20. -22. -14. -18. -20. -20.  k = ∞ -20. -20. -18. -14. -22. -20. -14. 0.0 295, Winter 2018 90

  47. Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Given a policy π Evaluate the policy π v π ( s ) = E [ R t +1 + γ R t +2 + ... | S t = s ] Improve the policy by acting greedily with respect to v π π ' = greedy( v π ) In Small Gridworld improved policy was optimal, π ' = π ∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π ∗ 295, Winter 2018 91

  48. Policy Iteration 295, Winter 2018 92

  49. Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate v π Iterative policy evaluation Policy improvement Generate π I ≥ π Greedy policy improvement 295, Winter 2018 93

  50. Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Policy Improvement 295, Winter 2018 94

  51. Lecture 3: Planning by Dynamic Programming Policy Improvement (2) Policy Iteration Policy Improvement If improvements stop, q π ( s , π ' ( s )) = max q π ( s , a ) = q π ( s , π ( s )) = v π ( s ) a ∈A Then the Bellman optimality equation has been satisfied v π ( s ) = max q π ( s , a ) a ∈A Therefore v π ( s ) = v ∗ ( s ) for all s ∈ S so π is an optimal policy 295, Winter 2018 95

  52. Lecture 3: Planning by Dynamic Programming Modified Policy Iteration Policy Iteration Extensions to Policy Iteration Does policy evaluation need to converge to v π ? Or should we introduce a stopping condition e.g. E -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section) 295, Winter 2018 96

  53. Lecture 3: Planning by Dynamic Programming Generalised Policy Iteration Policy Iteration Extensions to Policy Iteration Policy evaluation Estimate v π Any policy evaluation algorithm Policy improvement Generate π ' ≥ π Any policy improvement algorithm 295, Winter 2018 97

  54. Lecture 3: Planning by Dynamic Programming Principle of Optimality Value Iteration Value Iteration in MDPs Any optimal policy can be subdivided into two components: An optimal first action A ∗ Followed by an optimal policy from successor state S I Theorem (Principle of Optimality) A policy π ( a | s ) achieves the optimal value from state s, v π ( s ) = v ∗ ( s ) , if and onlyif For any state s ' reachable from s π achieves the optimal value from state s ' , v π ( s ' ) = v ∗ ( s ' ) 295, Winter 2018 98

  55. Lecture 3: Planning by Dynamic Programming Deterministic Value Iteration Value Iteration Value Iteration in MDPs 295, Winter 2018 99

  56. Value Iteration 295, Winter 2018 100

  57. Value Iteration 295, Winter 2018 101

  58. Lecture 3: Planning by Dynamic Programming Example: Shortest Path Value Iteration Value Iteration in MDPs g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2 0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 Problem V 1 V 2 V 3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 -1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4 -2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5 -3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6 V 4 V 5 V 6 V 7 295, Winter 2018 102

  59. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration Value Iteration in MDPs Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v 1 → v 2 → ... → v ∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) Convergence to v will be proven later ∗ Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy 295, Winter 2018 103

  60. Lecture 3: Planning by Dynamic Programming Value Iteration (2) Value Iteration Value Iteration in MDPs 295, Winter 2018 104

  61. Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected 295, Winter 2018 106

Recommend


More recommend