reinforcement learning
play

Reinforcement learning Applied artificial intelligence (EDA132) - PowerPoint PPT Presentation

Reinforcement learning Applied artificial intelligence (EDA132) Lecture 13 2012-04-26 Elin A. Topp Material based on course book, chapter 21 (17), and on lecture Belningsbaserad inlrning / Reinforcement learning by rjan Ekeberg,


  1. A 4x3 world • Fixed policy - passive learning. • Always start in state (1,1). • Do trials, observe, until terminal state is reached, update utilities • Eventually, agent learns how good the policy is - it can evaluate the policy and test different ones • Policy as described in the left grid is optimal with rewards of -0.04 for all reachable, nonterminal states, and without discounting. R R R +1 U U -1 U L L L 13 Friday, 27 April 2012

  2. A 4x3 world • Fixed policy - passive learning. • Always start in state (1,1). • Do trials, observe, until terminal state is reached, update utilities • Eventually, agent learns how good the policy is - it can evaluate the policy and test different ones • Policy as described in the left grid is optimal with rewards of -0.04 for all reachable, nonterminal states, and without discounting. R R R +1 0.812 0.868 0.918 +1 U U -1 0.762 0.660 -1 U L L L 0.705 0.655 0.611 0.388 13 Friday, 27 April 2012

  3. Outline • Reinforcement learning (chapter 21, with some references to 17) • Problem definition • Learning situation • Roll of the reward • Simplified assumptions • Central concepts and terms • Known (observable) environment • Bellman’s equation • Approaches to solutions • Unknown environment • Monte-Carlo method • Temporal-Difference learning • Q-Learning • Sarsa-Learning • Improvements • The usefulness of making mistakes • Eligibility Trace 14 Friday, 27 April 2012

  4. Environment model 15 Friday, 27 April 2012

  5. Environment model • Where do we get in each step? δ (s, a) ⟼ s’ 15 Friday, 27 April 2012

  6. Environment model • Where do we get in each step? δ (s, a) ⟼ s’ • What will the reward be? r( s, a) ⟼ ℜ 15 Friday, 27 April 2012

  7. Environment model • Where do we get in each step? δ (s, a) ⟼ s’ • What will the reward be? r( s, a) ⟼ ℜ 15 Friday, 27 April 2012

  8. Environment model • Where do we get in each step? δ (s, a) ⟼ s’ • What will the reward be? r( s, a) ⟼ ℜ The utility values of different states obey Bellman’s equation, given a fixed policy π : U π (s) = r( s, π (s)) + γ · U π ( δ ( s, π (s))) 15 Friday, 27 April 2012

  9. Solving the equation 16 Friday, 27 April 2012

  10. Solving the equation There are two ways of solving Bellman’s equation U π (s) = r( s, π (s)) + γ · U π ( δ ( s, π (s))) 16 Friday, 27 April 2012

  11. Solving the equation There are two ways of solving Bellman’s equation U π (s) = r( s, π (s)) + γ · U π ( δ ( s, π (s))) • Directly: U π (s) = r( s, π (s)) + γ · ∑ s’ P( s’ | s, π (s)) U π (s’) 16 Friday, 27 April 2012

  12. Recap: Random policy 0 -14 -20 -22 -14 -18 -22 -20 -20 -22 -18 -14 -22 -20 -14 0 U π (s) = r( s, π (s)) + γ · ∑ s’ P( s’ | s, π (s)) U π (s’) 17 Friday, 27 April 2012

  13. Solving the equation There are two ways of solving (this “optimal” version of) Bellman’s equation U π (s) = r( s, π (s)) + γ · U π ( δ ( s, π (s))) • Directly: U π (s) = r( s, π (s)) + γ · ∑ s’ P( s’ | s, π (s)) U π (s’) • Iteratively ( Value / utility iteration ), stop when equilibrium is reached, i.e., “nothing happens” π π U (s) ⟵ r( s, π (s)) + γ · U ( δ ( s, π (s))) k+1 k 18 Friday, 27 April 2012

  14. Bayesian reinforcement learning 19 Friday, 27 April 2012

  15. Bayesian reinforcement learning A remark: 19 Friday, 27 April 2012

  16. Bayesian reinforcement learning A remark: One form of reinforcement learning integrates Bayesian learning into the process to obtain the transition model, i.e., P( s’ | s, π (s)) 19 Friday, 27 April 2012

  17. Bayesian reinforcement learning A remark: One form of reinforcement learning integrates Bayesian learning into the process to obtain the transition model, i.e., P( s’ | s, π (s)) This means to assume a prior probability for each hypothesis on how the model might look like and then applying Bayes’ rule to obtain the posterior. 19 Friday, 27 April 2012

  18. Bayesian reinforcement learning A remark: One form of reinforcement learning integrates Bayesian learning into the process to obtain the transition model, i.e., P( s’ | s, π (s)) This means to assume a prior probability for each hypothesis on how the model might look like and then applying Bayes’ rule to obtain the posterior. We are not going into details here! 19 Friday, 27 April 2012

  19. Finding optimal policy and value function 20 Friday, 27 April 2012

  20. Finding optimal policy and value function How can we find an optimal policy π * ? 20 Friday, 27 April 2012

  21. Finding optimal policy and value function How can we find an optimal policy π * ? That would be easy if we had the optimal value / utility function U* : π *(s) = argmax( r( s, a) + γ · U * ( δ ( s, a))) a 20 Friday, 27 April 2012

  22. Finding optimal policy and value function How can we find an optimal policy π * ? That would be easy if we had the optimal value / utility function U* : π *(s) = argmax( r( s, a) + γ · U * ( δ ( s, a))) a Apply to the “optimal version” of Bellman’s equation U*(s) = max( r( s, a) + γ · U * ( δ ( s, a))) a 20 Friday, 27 April 2012

  23. Finding optimal policy and value function How can we find an optimal policy π * ? That would be easy if we had the optimal value / utility function U* : π *(s) = argmax( r( s, a) + γ · U * ( δ ( s, a))) a Apply to the “optimal version” of Bellman’s equation U*(s) = max( r( s, a) + γ · U * ( δ ( s, a))) a Tricky to solve ... but possible: Combine policy and value iteration by switching in each iteration step 20 Friday, 27 April 2012

  24. Policy iteration 21 Friday, 27 April 2012

  25. Policy iteration Policy iteration provides exactly this switch. 21 Friday, 27 April 2012

  26. Policy iteration Policy iteration provides exactly this switch. For each iteration step k : π k (s) = argmax( r( s, a) + γ · U k ( δ ( s, a))) a U k+1 (s) = r( s, π k (s)) + γ · U k ( δ ( s, π k (s))) 21 Friday, 27 April 2012

  27. Outline • Reinforcement learning (chapter 21, with some references to 17) • Problem definition • Learning situation • Roll of the reward • Simplified assumptions • Central concepts and terms • Known environment • Bellman’s equation • Approaches to solutions • Unknown environment • Monte-Carlo method • Temporal-Difference learning • Q-Learning • Sarsa-Learning • Improvements • The usefulness of making mistakes • Eligibility Trace 22 Friday, 27 April 2012

  28. Monte Carlo approach 23 Friday, 27 April 2012

  29. Monte Carlo approach Usually the reward r( s, a) and the state transition function δ ( s, a) are unknown to the learning agent. 23 Friday, 27 April 2012

  30. Monte Carlo approach Usually the reward r( s, a) and the state transition function δ ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) 23 Friday, 27 April 2012

  31. Monte Carlo approach Usually the reward r( s, a) and the state transition function δ ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) 23 Friday, 27 April 2012

  32. Monte Carlo approach Usually the reward r( s, a) and the state transition function δ ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) Still, we can estimate U* from experience , as a Monte Carlo approach will do: • Start with a randomly chosen s • Follow a policy π , store rewards and s t for the step at time t • When the goal is reached, update the U π (s) estimate for all visited states s t with the future reward that was given when reaching the goal • Start over with a randomly chosen s ... 23 Friday, 27 April 2012

  33. Monte Carlo approach Usually the reward r( s, a) and the state transition function δ ( s, a) are unknown to the learning agent. (What does that mean for learning to ride a bike? ) Still, we can estimate U* from experience , as a Monte Carlo approach will do: • Start with a randomly chosen s • Follow a policy π , store rewards and s t for the step at time t • When the goal is reached, update the U π (s) estimate for all visited states s t with the future reward that was given when reaching the goal • Start over with a randomly chosen s ... Converges slowly... 23 Friday, 27 April 2012

  34. Temporal Difference learning 24 Friday, 27 April 2012

  35. Temporal Difference learning Temporal Difference learning ... 24 Friday, 27 April 2012

  36. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: 24 Friday, 27 April 2012

  37. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state 24 Friday, 27 April 2012

  38. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state 24 Friday, 27 April 2012

  39. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state 24 Friday, 27 April 2012

  40. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state Or: What the agent believes before acting U π ( s t ) 24 Friday, 27 April 2012

  41. Temporal Difference learning Temporal Difference learning ... ... uses the fact that there are two estimates for the value of a state: before and after visiting the state Or: What the agent believes before acting U π ( s t ) and after acting r t+1 + γ · U π ( s t+1 ) 24 Friday, 27 April 2012

  42. Applying the estimates 25 Friday, 27 April 2012

  43. Applying the estimates The second estimate in the Temporal Difference learning approach is obviously “better”, ... 25 Friday, 27 April 2012

  44. Applying the estimates The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate 25 Friday, 27 April 2012

  45. Applying the estimates The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate U π ( s t ) ⟵ U π ( s t ) + α [ r t+1 + γ · U π ( s t+1 ) - U π ( s t )] 25 Friday, 27 April 2012

  46. Applying the estimates The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate U π ( s t ) ⟵ U π ( s t ) + α [ r t+1 + γ · U π ( s t+1 ) - U π ( s t )] Which gives us a measure of the “surprise” or “disappointment” for the outcome of an action. 25 Friday, 27 April 2012

  47. Applying the estimates The second estimate in the Temporal Difference learning approach is obviously “better”, ... ... hence, we update the overall approximation of a state’s value towards the more accurate estimate U π ( s t ) ⟵ U π ( s t ) + α [ r t+1 + γ · U π ( s t+1 ) - U π ( s t )] Which gives us a measure of the “surprise” or “disappointment” for the outcome of an action. Converges significantly faster than the pure Monte Carlo approach. 25 Friday, 27 April 2012

  48. Q-learning 26 Friday, 27 April 2012

  49. Q-learning Problem: 26 Friday, 27 April 2012

  50. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. 26 Friday, 27 April 2012

  51. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : 26 Friday, 27 April 2012

  52. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s 26 Friday, 27 April 2012

  53. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s π (s) = argmax Q( s, a) 26 Friday, 27 April 2012

  54. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s π (s) = argmax Q( s, a) a 26 Friday, 27 April 2012

  55. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s π (s) = argmax Q( s, a) a U*( s) = max Q*( s, a) 26 Friday, 27 April 2012

  56. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s π (s) = argmax Q( s, a) a U*( s) = max Q*( s, a) a 26 Friday, 27 April 2012

  57. Q-learning Problem: even if U is appropriately estimated, it is not possible to compute π , as the agent has no knowledge about δ and r , i.e., it needs to learn also that. Solution (trick): Estimate Q( s, a) instead of U(s) : Q( s, a) : Expected total reward when choosing a in s π (s) = argmax Q( s, a) a U*( s) = max Q*( s, a) a 26 Friday, 27 April 2012

  58. Learning Q 27 Friday, 27 April 2012

  59. Learning Q How can we learn Q ? 27 Friday, 27 April 2012

  60. Learning Q How can we learn Q ? Also the Q -function can be learned using the Temporal Difference approach: 27 Friday, 27 April 2012

  61. Learning Q How can we learn Q ? Also the Q -function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α [ r + γ max Q( s’, a’) - Q( s, a)] 27 Friday, 27 April 2012

  62. Learning Q How can we learn Q ? Also the Q -function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α [ r + γ max Q( s’, a’) - Q( s, a)] a’ 27 Friday, 27 April 2012

  63. Learning Q How can we learn Q ? Also the Q -function can be learned using the Temporal Difference approach: Q( s, a) ⟵ Q( s, a) + α [ r + γ max Q( s’, a’) - Q( s, a)] a’ With s’ being the next state that is reached when choosing action a’ 27 Friday, 27 April 2012

Recommend


More recommend