inf3490 biologically inspired computing
play

INF3490 - Biologically inspired computing Reinforcement Learning - PowerPoint PPT Presentation

INF3490 - Biologically inspired computing Reinforcement Learning Weria Khaksar October 10, 2018 The Commuter (2018) Ghostbusters (1984) 10.10.2018 2 It would be in vain for one Intelligent Being , to set a Rule to the Actions of another,


  1. INF3490 - Biologically inspired computing Reinforcement Learning Weria Khaksar October 10, 2018

  2. The Commuter (2018) Ghostbusters (1984) 10.10.2018 2

  3. ”It would be in vain for one Intelligent Being , to set a Rule to the Actions of another, if he had not in his Power, to reward the compliance with, and punish deviation from his Rule, by some Good and Evil, that is not the natural product and consequence of the action itself. ” (Locke, ”Essay”, 2.28.6) ”The use of punishments and rewards can at best be a part of the teaching process. Roughly speaking, if the teacher has no other means of communicating to the pupil, the amount of information which can reach him does not exceed the total number of rewards and punishments applied . ” (Turing (1950) ”Computing Machinery and Intelligence”) 10.10.2018 3

  4. Applications of RL: From: ”Deconstructing Reinforcement Learning” ICML 2009 10.10.2018 4

  5. Examples: Barrett WAM robot learning to flip pancakes by reinforcement learning Socially Aware Motion Planning with Deep Reinforcement Learning Hierarchical Reinforcement Learning for Robot Navigation Google DeepMind's Deep Q-learning playing Atari Breakout 10.10.2018 5

  6. Last time: Supervised learning “CAT” Untrained Classifier No, it was a dog. Adjust classifier parameters 10.10.2018 6

  7. Supervised learning: Weight updates inputs x 1 weights w 1  w 2 activation y x 2 a=  i=1 n w i x i q . . 1 if a  q output { y = 0 if a < q . w n x n 7 10.10.2018

  8. Reinforcement Learning: Infrequent Feedback 50 chess moves later You lost Update chess- playing strategy 10.10.2018 8

  9. How do we update our system now? We don’t know the error! 10.10.2018 9

  10. 10.10.2018 10

  11. 10.10.2018 11

  12. 10.10.2018 12

  13. The reinforcement learning problem: State, Action and Reward 10.10.2018 13

  14. The reinforcement learning problem: State, Action and Reward 10.10.2018 14

  15. The reinforcement learning problem: State, Action and Reward 10.10.2018 15

  16. The reinforcement learning problem: State, Action and Reward “Move piece from J1 to H1” 10.10.2018 16

  17. The reinforcement learning problem: State, Action and Reward You took an opponent’s piece. Reward=1 10.10.2018 17

  18. The reinforcement learning problem: State, Action and Reward 10.10.2018 18

  19. The reinforcement learning problem: State, Action and Reward 10.10.2018 19

  20. The reinforcement learning problem: State, Action and Reward Learning is guided by the reward • An infrequent numerical feedback indicating how well we are doing. • Problems: – The reward does not tell us what we should have done! – The reward may be delayed – does not always indicate when we made a mistake. 10.10.2018 20

  21. The reinforcement learning problem: The reward Function • Corresponds to the fitness function of an evolutionary algorithm. • 𝑠 𝑢+1 is a function of 𝑡 𝑢 , 𝑏 𝑢 . • The reward is a numeric value. Can be negative (“punishment”) . • Can be given throughout the learning episode, or only in the end. • Goal: Maximize total reward. 10.10.2018 21

  22. The reinforcement learning problem: Maximizing total reward ▪ Total reward: 𝑂−1 𝑆 = ෍ 𝑠 𝑢+1 𝑢=0 Future rewards may be uncertain and we might care more about rewards that come soon. Therefore, we discount future rewards: ∞ 𝛿 𝑢 . 𝑠 𝑆 = ෍ 𝑢+1 , 0 ≤ 𝛿 ≤ 1 𝑢=0 or ∞ 𝛿 𝑙 . 𝑠 𝑆 = ෍ 𝑢+𝑙+1 , 0 ≤ 𝛿 ≤ 1 𝑙=0 10.10.2018 22

  23. The reinforcement learning problem: Maximizing total reward ▪ Future reward: 𝑆 = 𝑠 1 + 𝑠 2 + 𝑠 3 + ⋯ + 𝑠 𝑜 𝑆 𝑢 = 𝑠 𝑢 + 𝑠 𝑢+1 + 𝑠 𝑢+2 + ⋯ + 𝑠 𝑜 ▪ Discount future rewards (environment is stochastic) 𝑢+1 + 𝛿 2 𝑠 𝑢+2 + ⋯ + 𝛿 𝑜−𝑢 𝑠 𝑆 𝑢 = 𝑠 𝑢 + 𝛿𝑠 𝑜 = 𝑠 𝑢 + 𝛿(𝑠 𝑢+1 +𝛿(𝑠 𝑢+2 + ⋯ )) = 𝑠 𝑢 + 𝛿𝑆 𝑢+1 ▪ A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward. 10.10.2018 23

  24. The reinforcement learning problem: Discounted rewards example 0.99 𝑢 0.95 𝑢 0.50 𝑢 0.05 𝑢 𝑢 1 0,990000 0,950000 0,500000 0,050000 2 0,980100 0,902500 0,250000 0,002500 4 0,960596 0,814506 0,062500 0,000006 8 0,922745 0,663420 0,003906 0,000000 16 0,851458 0,440127 0,000015 0,000000 32 0,724980 0,193711 0,000000 0,000000 64 0,525596 0,037524 0,000000 0,000000 10.10.2018 24

  25. The reinforcement learning problem: Discounted rewards example 1.00 0.90 0,99 0,95 0,50 0,05 0.80 0.70 γ ^time 0.60 0.50 0.40 0.30 0.20 0.10 0.00 time 0 10 20 30 40 50 60 10.10.2018 25

  26. The reinforcement learning problem: Action Selection • At each learning stage, the RL algorithm looks at the possible actions and calculates the expected average reward. 𝑅 𝑡,𝑢 𝑏 • Based on 𝑅 𝑡,𝑢 𝑏 , an action will be selected using: ➢ Greedy strategy: pure exploitation ➢ 𝜻 -Greedy strategy: exploitation with a little exploration 𝑓 (𝑅𝑡,𝑢 𝑏 /𝜐) ➢ Soft-Max strategy: 𝑄 𝑅 𝑡,𝑢 𝑏 = σ 𝑐 𝑓 (𝑅𝑡,𝑢 𝑐 /𝜐) 10.10.2018 26

  27. The reinforcement learning problem: Policy (𝜌) and Value (𝑊) ▪ The set of actions we took define our policy (𝜌) . ▪ The expected rewards we get in return, defines our value (𝑊) . 10.10.2018 27

  28. The reinforcement learning problem: Markov Decision Process • If we only need to know the current state, the problem has the Markov property . • No Markov Property: 𝑢 = 𝑠 ′ , 𝑡 𝑢+1 = 𝑡 ′ |𝑡 𝑢 , 𝑏 𝑢 , 𝑠 𝑄 𝑠 (𝑠 𝑢−1 , … , 𝑠 1 , 𝑡 1 , 𝑏 1 , 𝑡 0 , 𝑏 0 ) • Markov Property: 𝑢 = 𝑠 ′ , 𝑡 𝑢+1 = 𝑡 ′ |𝑡 𝑢 , 𝑏 𝑢 ) 𝑄 𝑠 (𝑠 10.10.2018 28

  29. The reinforcement learning problem: Markov Decision Process A simple example of a Markov Decision Process 10.10.2018 29

  30. The reinforcement learning problem: Value • The expected future reward is known as the value. • Two ways to compute the value: – The value of a state, 𝑊(𝑡) , averaged over all possible actions in that state. (state-value function) ∞ 𝛿 𝑗 . 𝑠 𝑢+𝑗+1 | 𝑡 𝑢 = 𝑡 𝑊 𝑡 = 𝐹 𝑠 𝑢 𝑡 𝑢 = 𝑡 = 𝐹 ෍ 𝑗=0 – The value of a state/action pair 𝑅(𝑡, 𝑏) . (action-value function) ∞ 𝛿 𝑗 . 𝑠 𝑢+𝑗+1 | 𝑡 𝑢 = 𝑡, 𝑏 𝑢 = 𝑏 𝑅 𝑡, 𝑏 = 𝐹 𝑠 𝑢 𝑡 𝑢 = 𝑡, 𝑏 𝑢 = 𝑏 = 𝐹 ෍ 𝑗=0 • 𝑹 and 𝑾 are initially unknown, and learned iteratively as we gain experience. 10.10.2018 30

  31. The reinforcement learning problem: The Q-Learning Algorithm 10.10.2018 31

  32. The reinforcement learning problem: The Q-Learning Algorithm 10.10.2018 32

  33. The reinforcement learning problem: The SARSA Algorithm 10.10.2018 33

  34. Q-learning example • Credits: Arjun Chandra home -1 10.10.2018 34

  35. 10.10.2018 35

  36. 10.10.2018 36

  37. 10.10.2018 37

  38. 10.10.2018 38

  39. 10.10.2018 39

  40. 10.10.2018 40

  41. 10.10.2018 41

  42. 10.10.2018 42

  43. 10.10.2018 43

  44. 10.10.2018 44

  45. 10.10.2018 45

  46. 10.10.2018 46

  47. 10.10.2018 47

  48. 10.10.2018 48

  49. 10.10.2018 49

  50. 10.10.2018 50

  51. 10.10.2018 51

  52. 10.10.2018 52

  53. 10.10.2018 53

  54. 10.10.2018 54

  55. 10.10.2018 55

  56. 10.10.2018 56

  57. 10.10.2018 57

  58. 10.10.2018 58

  59. 10.10.2018 59

  60. Action selection 60 10.10.2018

  61. On-policy vs off-policy learning • Reward structure: Each move: -1. Move to cliff: -100. • Policy: 90% chance of choosing best action (exploit). 10% chance of choosing random action (explore). Start The Cliff Goal 10.10.2018 61

  62. On-policy vs off-policy learning: Q-learning • Always assumes optimal action -> does not visit cliff often while learning. Therefore, does not learn that cliff is dangerous. • Resulting path is efficient, but risky. Start The Cliff Goal 10.10.2018 62

  63. On-policy vs off-policy learning: SARSA • During learning, we more frequently end up outside the cliff (due to the 10% chance of exploring in our policy). • That info propagates to all states, generating a safer plan. Start The Cliff Goal 10.10.2018 63

  64. Which plan is better? • SARSA (on-policy): Start The Cliff Goal • Q-learning (off-policy): Start The Cliff Goal 10.10.2018 64

  65. Using evolution and neural networks in reinforcement learning 10.10.2018 65 MarI/O - Machine Learning for Video Games

  66. 10.10.2018 66

Recommend


More recommend