INF3490 - Biologically inspired computing Reinforcement Learning Weria Khaksar October 10, 2018
The Commuter (2018) Ghostbusters (1984) 10.10.2018 2
”It would be in vain for one Intelligent Being , to set a Rule to the Actions of another, if he had not in his Power, to reward the compliance with, and punish deviation from his Rule, by some Good and Evil, that is not the natural product and consequence of the action itself. ” (Locke, ”Essay”, 2.28.6) ”The use of punishments and rewards can at best be a part of the teaching process. Roughly speaking, if the teacher has no other means of communicating to the pupil, the amount of information which can reach him does not exceed the total number of rewards and punishments applied . ” (Turing (1950) ”Computing Machinery and Intelligence”) 10.10.2018 3
Applications of RL: From: ”Deconstructing Reinforcement Learning” ICML 2009 10.10.2018 4
Examples: Barrett WAM robot learning to flip pancakes by reinforcement learning Socially Aware Motion Planning with Deep Reinforcement Learning Hierarchical Reinforcement Learning for Robot Navigation Google DeepMind's Deep Q-learning playing Atari Breakout 10.10.2018 5
Last time: Supervised learning “CAT” Untrained Classifier No, it was a dog. Adjust classifier parameters 10.10.2018 6
Supervised learning: Weight updates inputs x 1 weights w 1 w 2 activation y x 2 a= i=1 n w i x i q . . 1 if a q output { y = 0 if a < q . w n x n 7 10.10.2018
Reinforcement Learning: Infrequent Feedback 50 chess moves later You lost Update chess- playing strategy 10.10.2018 8
How do we update our system now? We don’t know the error! 10.10.2018 9
10.10.2018 10
10.10.2018 11
10.10.2018 12
The reinforcement learning problem: State, Action and Reward 10.10.2018 13
The reinforcement learning problem: State, Action and Reward 10.10.2018 14
The reinforcement learning problem: State, Action and Reward 10.10.2018 15
The reinforcement learning problem: State, Action and Reward “Move piece from J1 to H1” 10.10.2018 16
The reinforcement learning problem: State, Action and Reward You took an opponent’s piece. Reward=1 10.10.2018 17
The reinforcement learning problem: State, Action and Reward 10.10.2018 18
The reinforcement learning problem: State, Action and Reward 10.10.2018 19
The reinforcement learning problem: State, Action and Reward Learning is guided by the reward • An infrequent numerical feedback indicating how well we are doing. • Problems: – The reward does not tell us what we should have done! – The reward may be delayed – does not always indicate when we made a mistake. 10.10.2018 20
The reinforcement learning problem: The reward Function • Corresponds to the fitness function of an evolutionary algorithm. • 𝑠 𝑢+1 is a function of 𝑡 𝑢 , 𝑏 𝑢 . • The reward is a numeric value. Can be negative (“punishment”) . • Can be given throughout the learning episode, or only in the end. • Goal: Maximize total reward. 10.10.2018 21
The reinforcement learning problem: Maximizing total reward ▪ Total reward: 𝑂−1 𝑆 = 𝑠 𝑢+1 𝑢=0 Future rewards may be uncertain and we might care more about rewards that come soon. Therefore, we discount future rewards: ∞ 𝛿 𝑢 . 𝑠 𝑆 = 𝑢+1 , 0 ≤ 𝛿 ≤ 1 𝑢=0 or ∞ 𝛿 𝑙 . 𝑠 𝑆 = 𝑢+𝑙+1 , 0 ≤ 𝛿 ≤ 1 𝑙=0 10.10.2018 22
The reinforcement learning problem: Maximizing total reward ▪ Future reward: 𝑆 = 𝑠 1 + 𝑠 2 + 𝑠 3 + ⋯ + 𝑠 𝑜 𝑆 𝑢 = 𝑠 𝑢 + 𝑠 𝑢+1 + 𝑠 𝑢+2 + ⋯ + 𝑠 𝑜 ▪ Discount future rewards (environment is stochastic) 𝑢+1 + 𝛿 2 𝑠 𝑢+2 + ⋯ + 𝛿 𝑜−𝑢 𝑠 𝑆 𝑢 = 𝑠 𝑢 + 𝛿𝑠 𝑜 = 𝑠 𝑢 + 𝛿(𝑠 𝑢+1 +𝛿(𝑠 𝑢+2 + ⋯ )) = 𝑠 𝑢 + 𝛿𝑆 𝑢+1 ▪ A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward. 10.10.2018 23
The reinforcement learning problem: Discounted rewards example 0.99 𝑢 0.95 𝑢 0.50 𝑢 0.05 𝑢 𝑢 1 0,990000 0,950000 0,500000 0,050000 2 0,980100 0,902500 0,250000 0,002500 4 0,960596 0,814506 0,062500 0,000006 8 0,922745 0,663420 0,003906 0,000000 16 0,851458 0,440127 0,000015 0,000000 32 0,724980 0,193711 0,000000 0,000000 64 0,525596 0,037524 0,000000 0,000000 10.10.2018 24
The reinforcement learning problem: Discounted rewards example 1.00 0.90 0,99 0,95 0,50 0,05 0.80 0.70 γ ^time 0.60 0.50 0.40 0.30 0.20 0.10 0.00 time 0 10 20 30 40 50 60 10.10.2018 25
The reinforcement learning problem: Action Selection • At each learning stage, the RL algorithm looks at the possible actions and calculates the expected average reward. 𝑅 𝑡,𝑢 𝑏 • Based on 𝑅 𝑡,𝑢 𝑏 , an action will be selected using: ➢ Greedy strategy: pure exploitation ➢ 𝜻 -Greedy strategy: exploitation with a little exploration 𝑓 (𝑅𝑡,𝑢 𝑏 /𝜐) ➢ Soft-Max strategy: 𝑄 𝑅 𝑡,𝑢 𝑏 = σ 𝑐 𝑓 (𝑅𝑡,𝑢 𝑐 /𝜐) 10.10.2018 26
The reinforcement learning problem: Policy (𝜌) and Value (𝑊) ▪ The set of actions we took define our policy (𝜌) . ▪ The expected rewards we get in return, defines our value (𝑊) . 10.10.2018 27
The reinforcement learning problem: Markov Decision Process • If we only need to know the current state, the problem has the Markov property . • No Markov Property: 𝑢 = 𝑠 ′ , 𝑡 𝑢+1 = 𝑡 ′ |𝑡 𝑢 , 𝑏 𝑢 , 𝑠 𝑄 𝑠 (𝑠 𝑢−1 , … , 𝑠 1 , 𝑡 1 , 𝑏 1 , 𝑡 0 , 𝑏 0 ) • Markov Property: 𝑢 = 𝑠 ′ , 𝑡 𝑢+1 = 𝑡 ′ |𝑡 𝑢 , 𝑏 𝑢 ) 𝑄 𝑠 (𝑠 10.10.2018 28
The reinforcement learning problem: Markov Decision Process A simple example of a Markov Decision Process 10.10.2018 29
The reinforcement learning problem: Value • The expected future reward is known as the value. • Two ways to compute the value: – The value of a state, 𝑊(𝑡) , averaged over all possible actions in that state. (state-value function) ∞ 𝛿 𝑗 . 𝑠 𝑢+𝑗+1 | 𝑡 𝑢 = 𝑡 𝑊 𝑡 = 𝐹 𝑠 𝑢 𝑡 𝑢 = 𝑡 = 𝐹 𝑗=0 – The value of a state/action pair 𝑅(𝑡, 𝑏) . (action-value function) ∞ 𝛿 𝑗 . 𝑠 𝑢+𝑗+1 | 𝑡 𝑢 = 𝑡, 𝑏 𝑢 = 𝑏 𝑅 𝑡, 𝑏 = 𝐹 𝑠 𝑢 𝑡 𝑢 = 𝑡, 𝑏 𝑢 = 𝑏 = 𝐹 𝑗=0 • 𝑹 and 𝑾 are initially unknown, and learned iteratively as we gain experience. 10.10.2018 30
The reinforcement learning problem: The Q-Learning Algorithm 10.10.2018 31
The reinforcement learning problem: The Q-Learning Algorithm 10.10.2018 32
The reinforcement learning problem: The SARSA Algorithm 10.10.2018 33
Q-learning example • Credits: Arjun Chandra home -1 10.10.2018 34
10.10.2018 35
10.10.2018 36
10.10.2018 37
10.10.2018 38
10.10.2018 39
10.10.2018 40
10.10.2018 41
10.10.2018 42
10.10.2018 43
10.10.2018 44
10.10.2018 45
10.10.2018 46
10.10.2018 47
10.10.2018 48
10.10.2018 49
10.10.2018 50
10.10.2018 51
10.10.2018 52
10.10.2018 53
10.10.2018 54
10.10.2018 55
10.10.2018 56
10.10.2018 57
10.10.2018 58
10.10.2018 59
Action selection 60 10.10.2018
On-policy vs off-policy learning • Reward structure: Each move: -1. Move to cliff: -100. • Policy: 90% chance of choosing best action (exploit). 10% chance of choosing random action (explore). Start The Cliff Goal 10.10.2018 61
On-policy vs off-policy learning: Q-learning • Always assumes optimal action -> does not visit cliff often while learning. Therefore, does not learn that cliff is dangerous. • Resulting path is efficient, but risky. Start The Cliff Goal 10.10.2018 62
On-policy vs off-policy learning: SARSA • During learning, we more frequently end up outside the cliff (due to the 10% chance of exploring in our policy). • That info propagates to all states, generating a safer plan. Start The Cliff Goal 10.10.2018 63
Which plan is better? • SARSA (on-policy): Start The Cliff Goal • Q-learning (off-policy): Start The Cliff Goal 10.10.2018 64
Using evolution and neural networks in reinforcement learning 10.10.2018 65 MarI/O - Machine Learning for Video Games
10.10.2018 66
Recommend
More recommend