Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Machine Learning • Supervised Learning – Teacher tells learner what to remember • Reinforcement Learning – Environment provides hints to learner • Unsupervised Learning – Learner discovers on its own 2 CS886 (c) 2013 Pascal Poupart
Animal Psychology • Negative reinforcements: – Pain and hunger • Positive reinforcements: – Pleasure and food • Reinforcements used to train animals • Let’s do the same with computers! 3 CS886 (c) 2013 Pascal Poupart
RL Examples • Game playing (backgammon, solitaire) • Operations research (pricing, vehicule routing) • Elevator scheduling • Helicopter control • Spoken dialog systems 4 CS886 (c) 2013 Pascal Poupart
Reinforcement Learning • Definition: – Markov decision process with unknown transition and reward models • Set of states S • Set of actions A – Actions may be stochastic • Set of reinforcement signals (rewards) – Rewards may be delayed 5 CS886 (c) 2013 Pascal Poupart
Policy optimization • Markov Decision Process: – Find optimal policy given transition and reward model – Execute policy found • Reinforcement learning: – Learn an optimal policy while interacting with the environment 6 CS886 (c) 2013 Pascal Poupart
Reinforcement Learning Problem Agent State Action Reward Environment a0 a1 a2 … s0 s1 s2 r1 r2 r0 Goal: Learn to choose actions that maximize 1 + 𝛿 2 𝑠 2 + 𝛿 3 𝑠 𝑠 0 + 𝛿𝑠 3 + ⋯ 7 CS886 (c) 2013 Pascal Poupart
Example: Inverted Pendulum • State: 𝑦 𝑢 , 𝑦 ′ 𝑢 , 𝜄 𝑢 , 𝜄′(𝑢) • Action: Force 𝐺 • Reward: 1 for any step where pole balanced Problem: Find 𝜌: 𝑇 → 𝐵 that maximizes rewards 8 CS886 (c) 2013 Pascal Poupart
Types of RL • Passive vs Active learning – Passive learning: the agent executes a fixed policy and tries to evaluate it – Active learning: the agent updates its policy as it learns • Model based vs model free – Model-based: learn transition and reward model and use it to determine optimal policy – Model free: derive optimal policy without learning the model 9 CS886 (c) 2013 Pascal Poupart
Passive Learning • Transition and reward model known: – Evaluate 𝜌 : 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ • Transition and reward model unknown: – Estimate value of policy as agent executes policy: 𝑊 𝜌 𝑡 = 𝐹 𝜌 [ 𝛿 𝑢 𝑆(𝑡 𝑢 , 𝜌 𝑡 𝑢 )] 𝑢 – Model based vs model free 10 CS886 (c) 2013 Pascal Poupart
Passive learning r r r +1 = 1 3 u u -1 Reward is -0.04 for 2 non-terminal states u l l l 1 Do not know the transition 1 2 3 4 probabilities (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) +1 (1,1) (2,1) (3,1) (3,2) (4,2) -1 What is the value 𝑊(𝑡) of being in state 𝑡 ? 11 CS886 (c) 2013 Pascal Poupart
Passive ADP • Adaptive dynamic programming (ADP) – Model-based – Learn transition probabilities and rewards from observations – Then update the values of the states 12 CS886 (c) 2013 Pascal Poupart
ADP Example r r r +1 = 1 We need to 3 learn all the u u -1 2 Reward is -0.04 for transition u l l l non-terminal states probabilities! 1 1 2 3 4 (1,1) (1,2) (1,3) (1,2) (1,3) (2,3) (3,3) (4,3) +1 (1,1) (1,2) (1,3) (2,3) (3,3) (3,2) (3,3) (4,3) +1 (1,1) (2,1) (3,1) (3,2) (4,2) -1 P((2,3)|(1,3),r) =2/3 Use this information in P((1,2)|(1,3),r) =1/3 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 𝑡 ′ 13 CS886 (c) 2013 Pascal Poupart
Passive ADP PassiveADP( 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1, 𝑜 𝑡, 𝑡 ′ ← 𝑜 𝑡, 𝑡 ′ + 1 𝑜(𝑡,𝑡 ′ ) Update transition: Pr 𝑡 ′ 𝑡, 𝜌(𝑡) ← 𝑜(𝑡) ∀𝑡′ 𝑠 + 𝑜 𝑡 −1 𝑆(𝑡,𝜌 𝑡 ) Update reward: 𝑆 𝑡, 𝜌 𝑡 ← 𝑜(𝑡) Solve: 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 𝑡 ′ ∀𝑡 + 𝛿 𝑡 ′ 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 14 CS886 (c) 2013 Pascal Poupart
Passive TD • Temporal difference (TD) – Model free • At each time step – Observe: 𝑡, 𝑏, 𝑡’, 𝑠 – Update 𝑊 𝜌 (𝑡) after each move – 𝑊 𝜌 (𝑡) = 𝑊 𝜌 (𝑡) + 𝛽 (𝑆(𝑡, 𝜌(𝑡)) + 𝛿 𝑊 𝜌 (𝑡’) – 𝑊 𝜌 (𝑡)) Learning rate Temporal difference 15 CS886 (c) 2013 Pascal Poupart
TD Convergence Theorem: If 𝛽 is appropriately decreased with number of times a state is visited then 𝑊 𝜌 (𝑡) converges to correct value • 𝛽 must satisfy: • 𝛽 𝑢 → ∞ 𝑢 • 𝛽 𝑢 2 < ∞ 𝑢 • Often 𝛽 𝑡 = 1/𝑜(𝑡) • Where 𝑜(𝑡) = # of times 𝑡 is visited 16 CS886 (c) 2013 Pascal Poupart
Passive TD PassiveTD( 𝜌, 𝑊 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update value: 𝑊 𝜌 𝑡 ← 𝑊 𝜌 𝑡 + 𝛽(𝑠 + 𝛿𝑊 𝜌 𝑡 ′ − 𝑊 𝜌 𝑡 ) 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 17 CS886 (c) 2013 Pascal Poupart
Comparison • Model free approach: – Less computation per time step • Model based approach: – Fewer time steps before convergence 18 CS886 (c) 2013 Pascal Poupart
Active Learning • Ultimately, we are interested in improving 𝜌 • Transition and reward model known: 𝑊 ∗ 𝑡 = max 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑏 𝑡 ′ • Transition and reward model unknown: – Improve policy as agent executes policy – Model based vs model free 19 CS886 (c) 2013 Pascal Poupart
Q-learning (aka active temporal difference) • Q-function: 𝑅: 𝑇 × 𝐵 → ℜ – Value of state-action pair – Policy 𝜌 𝑡 = 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) is the greedy policy w.r.t. 𝑅 • Bellman’s equation: 𝑅 ∗ 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 Pr (𝑡 ′ |𝑡, 𝑏) 𝑏 ′ 𝑅 ∗ (𝑡′, 𝑏 ′ ) max 𝑡 ′ 20 CS886 (c) 2013 Pascal Poupart
Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 Return 𝑅 21 CS886 (c) 2013 Pascal Poupart
Q-learning example s 1 s 2 s 2 73 81.5 100 100 66 66 81 81 = 0.9 , = 0.5 , 𝑠 = 0 for non-terminal states 𝑏 ′ 𝑅 𝑡 2 , 𝑏 ′ − 𝑅 𝑡 1 , 𝑠𝑗ℎ𝑢 𝑅 𝑡 1 , 𝑠𝑗ℎ𝑢 = 𝑅 𝑡 1 , 𝑠𝑗ℎ𝑢 + 𝛽 𝑠 + 𝛿 max = 73 + 0.5 0 + 0.9 max 66,81,100 − 73 = 73 + 0.5(17) = 81.5 22 CS886 (c) 2013 Pascal Poupart
Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 ∗ Return 𝑅 ∗ 23 CS886 (c) 2013 Pascal Poupart
Exploration vs Exploitation • If an agent always chooses the action with the highest value then it is exploiting – The learned model is not the real model – Leads to suboptimal results • By taking random actions (pure exploration) an agent may learn the model – But what is the use of learning a complete model if parts of it are never used? • Need a balance between exploitation and exploration 24 CS886 (c) 2013 Pascal Poupart
Common exploration methods • -greedy: – With probability 𝜗 execute random action – Otherwise execute best action 𝑏 ∗ 𝑏 ∗ = 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) • Boltzmann exploration 𝑅 𝑡,𝑏 𝑓 𝑈 Pr 𝑏 = 𝑅 𝑡,𝑏 𝑓 𝑈 𝑏 25 CS886 (c) 2013 Pascal Poupart
Exploration and Q-learning • Q-learning converges to optimal Q-values if – Every state is visited infinitely often (due to exploration) – The action selection becomes greedy as time approaches infinity – The learning rate 𝛽 is decreased fast enough, but not too fast 26 CS886 (c) 2013 Pascal Poupart
Model-based Active RL • Idea: at each step – Execute action – Observe resulting state and reward – Update model – Update policy 𝜌 27 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend