module 11
play

Module 11 Introduction to Reinforcement Learning CS 886 Sequential - PowerPoint PPT Presentation

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Machine Learning Supervised Learning Teacher tells learner what to remember Reinforcement Learning


  1. Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Machine Learning • Supervised Learning – Teacher tells learner what to remember • Reinforcement Learning – Environment provides hints to learner • Unsupervised Learning – Learner discovers on its own 2 CS886 (c) 2013 Pascal Poupart

  3. Animal Psychology • Negative reinforcements: – Pain and hunger • Positive reinforcements: – Pleasure and food • Reinforcements used to train animals • Let’s do the same with computers! 3 CS886 (c) 2013 Pascal Poupart

  4. RL Examples • Game playing (backgammon, solitaire) • Operations research (pricing, vehicule routing) • Elevator scheduling • Helicopter control • Spoken dialog systems 4 CS886 (c) 2013 Pascal Poupart

  5. Reinforcement Learning • Definition: – Markov decision process with unknown transition and reward models • Set of states S • Set of actions A – Actions may be stochastic • Set of reinforcement signals (rewards) – Rewards may be delayed 5 CS886 (c) 2013 Pascal Poupart

  6. Policy optimization • Markov Decision Process: – Find optimal policy given transition and reward model – Execute policy found • Reinforcement learning: – Learn an optimal policy while interacting with the environment 6 CS886 (c) 2013 Pascal Poupart

  7. Reinforcement Learning Problem Agent State Action Reward Environment a0 a1 a2 … s0 s1 s2 r1 r2 r0 Goal: Learn to choose actions that maximize 1 + 𝛿 2 𝑠 2 + 𝛿 3 𝑠 𝑠 0 + 𝛿𝑠 3 + ⋯ 7 CS886 (c) 2013 Pascal Poupart

  8. Example: Inverted Pendulum • State: 𝑦 𝑢 , 𝑦 ′ 𝑢 , 𝜄 𝑢 , 𝜄′(𝑢) • Action: Force 𝐺 • Reward: 1 for any step where pole balanced Problem: Find 𝜌: 𝑇 → 𝐵 that maximizes rewards 8 CS886 (c) 2013 Pascal Poupart

  9. Types of RL • Passive vs Active learning – Passive learning: the agent executes a fixed policy and tries to evaluate it – Active learning: the agent updates its policy as it learns • Model based vs model free – Model-based: learn transition and reward model and use it to determine optimal policy – Model free: derive optimal policy without learning the model 9 CS886 (c) 2013 Pascal Poupart

  10. Passive Learning • Transition and reward model known: – Evaluate 𝜌 : 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ • Transition and reward model unknown: – Estimate value of policy as agent executes policy: 𝑊 𝜌 𝑡 = 𝐹 𝜌 [ 𝛿 𝑢 𝑆(𝑡 𝑢 , 𝜌 𝑡 𝑢 )] 𝑢 – Model based vs model free 10 CS886 (c) 2013 Pascal Poupart

  11. Passive learning r r r +1  = 1 3 u u -1 Reward is -0.04 for 2 non-terminal states u l l l 1 Do not know the transition 1 2 3 4 probabilities (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (3,2)  (3,3)  (4,3) +1 (1,1)  (2,1)  (3,1)  (3,2)  (4,2) -1 What is the value 𝑊(𝑡) of being in state 𝑡 ? 11 CS886 (c) 2013 Pascal Poupart

  12. Passive ADP • Adaptive dynamic programming (ADP) – Model-based – Learn transition probabilities and rewards from observations – Then update the values of the states 12 CS886 (c) 2013 Pascal Poupart

  13. ADP Example r r r +1  = 1 We need to 3 learn all the u u -1 2 Reward is -0.04 for transition u l l l non-terminal states probabilities! 1 1 2 3 4 (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (3,2)  (3,3)  (4,3) +1 (1,1)  (2,1)  (3,1)  (3,2)  (4,2) -1 P((2,3)|(1,3),r) =2/3 Use this information in P((1,2)|(1,3),r) =1/3 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 𝑡 ′ 13 CS886 (c) 2013 Pascal Poupart

  14. Passive ADP PassiveADP( 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1, 𝑜 𝑡, 𝑡 ′ ← 𝑜 𝑡, 𝑡 ′ + 1 𝑜(𝑡,𝑡 ′ ) Update transition: Pr 𝑡 ′ 𝑡, 𝜌(𝑡) ← 𝑜(𝑡) ∀𝑡′ 𝑠 + 𝑜 𝑡 −1 𝑆(𝑡,𝜌 𝑡 ) Update reward: 𝑆 𝑡, 𝜌 𝑡 ← 𝑜(𝑡) Solve: 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 𝑡 ′ ∀𝑡 + 𝛿 𝑡 ′ 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 14 CS886 (c) 2013 Pascal Poupart

  15. Passive TD • Temporal difference (TD) – Model free • At each time step – Observe: 𝑡, 𝑏, 𝑡’, 𝑠 – Update 𝑊 𝜌 (𝑡) after each move – 𝑊 𝜌 (𝑡) = 𝑊 𝜌 (𝑡) + 𝛽 (𝑆(𝑡, 𝜌(𝑡)) + 𝛿 𝑊 𝜌 (𝑡’) – 𝑊 𝜌 (𝑡)) Learning rate Temporal difference 15 CS886 (c) 2013 Pascal Poupart

  16. TD Convergence Theorem: If 𝛽 is appropriately decreased with number of times a state is visited then 𝑊 𝜌 (𝑡) converges to correct value • 𝛽 must satisfy: • 𝛽 𝑢 → ∞ 𝑢 • 𝛽 𝑢 2 < ∞ 𝑢 • Often 𝛽 𝑡 = 1/𝑜(𝑡) • Where 𝑜(𝑡) = # of times 𝑡 is visited 16 CS886 (c) 2013 Pascal Poupart

  17. Passive TD PassiveTD( 𝜌, 𝑊 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update value: 𝑊 𝜌 𝑡 ← 𝑊 𝜌 𝑡 + 𝛽(𝑠 + 𝛿𝑊 𝜌 𝑡 ′ − 𝑊 𝜌 𝑡 ) 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 17 CS886 (c) 2013 Pascal Poupart

  18. Comparison • Model free approach: – Less computation per time step • Model based approach: – Fewer time steps before convergence 18 CS886 (c) 2013 Pascal Poupart

  19. Active Learning • Ultimately, we are interested in improving 𝜌 • Transition and reward model known: 𝑊 ∗ 𝑡 = max 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑏 𝑡 ′ • Transition and reward model unknown: – Improve policy as agent executes policy – Model based vs model free 19 CS886 (c) 2013 Pascal Poupart

  20. Q-learning (aka active temporal difference) • Q-function: 𝑅: 𝑇 × 𝐵 → ℜ – Value of state-action pair – Policy 𝜌 𝑡 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) is the greedy policy w.r.t. 𝑅 • Bellman’s equation: 𝑅 ∗ 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 Pr (𝑡 ′ |𝑡, 𝑏) 𝑏 ′ 𝑅 ∗ (𝑡′, 𝑏 ′ ) max 𝑡 ′ 20 CS886 (c) 2013 Pascal Poupart

  21. Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 Return 𝑅 21 CS886 (c) 2013 Pascal Poupart

  22. Q-learning example s 1 s 2 s 2 73 81.5 100 100 66 66 81 81  = 0.9 ,  = 0.5 , 𝑠 = 0 for non-terminal states 𝑏 ′ 𝑅 𝑡 2 , 𝑏 ′ − 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 = 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 + 𝛽 𝑠 + 𝛿 max = 73 + 0.5 0 + 0.9 max 66,81,100 − 73 = 73 + 0.5(17) = 81.5 22 CS886 (c) 2013 Pascal Poupart

  23. Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 ∗ Return 𝑅 ∗ 23 CS886 (c) 2013 Pascal Poupart

  24. Exploration vs Exploitation • If an agent always chooses the action with the highest value then it is exploiting – The learned model is not the real model – Leads to suboptimal results • By taking random actions (pure exploration) an agent may learn the model – But what is the use of learning a complete model if parts of it are never used? • Need a balance between exploitation and exploration 24 CS886 (c) 2013 Pascal Poupart

  25. Common exploration methods •  -greedy: – With probability 𝜗 execute random action – Otherwise execute best action 𝑏 ∗ 𝑏 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) • Boltzmann exploration 𝑅 𝑡,𝑏 𝑓 𝑈 Pr 𝑏 = 𝑅 𝑡,𝑏 𝑓 𝑈 𝑏 25 CS886 (c) 2013 Pascal Poupart

  26. Exploration and Q-learning • Q-learning converges to optimal Q-values if – Every state is visited infinitely often (due to exploration) – The action selection becomes greedy as time approaches infinity – The learning rate 𝛽 is decreased fast enough, but not too fast 26 CS886 (c) 2013 Pascal Poupart

  27. Model-based Active RL • Idea: at each step – Execute action – Observe resulting state and reward – Update model – Update policy 𝜌 27 CS886 (c) 2013 Pascal Poupart

Recommend


More recommend