lecture 14 reinforcement learning
play

Lecture 14: Reinforcement Learning Fei-Fei Li & Justin Johnson - PowerPoint PPT Presentation

Lecture 14: Reinforcement Learning Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - Lecture 14 - May 23, 2017 May 23, 2017 1 Administrative Grades: - Midterm grades


  1. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Q i will converge to Q* as i -> infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 33

  2. Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Q i will converge to Q* as i -> infinity What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 34

  3. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 35

  4. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning ! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 36

  5. Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning ! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 37

  6. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 38

  7. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 39

  8. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 40

  9. Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y i ) it should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy � *) Gradient update (with respect to Q-function parameters θ): Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 41

  10. [Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 42

  11. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network FC-256 with weights 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 43

  12. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network FC-256 with weights 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state s t Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 44

  13. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : FC-4 (Q-values) neural network FC-256 with weights Familiar conv layers, 32 4x4 conv, stride 2 FC layer 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 45

  14. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), FC-256 corresponding to Q(s t , with weights a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 46

  15. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), FC-256 corresponding to Q(s t , with weights a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) 16 8x8 conv, stride 4 Number of actions between 4-18 depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 47

  16. [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : Last FC layer has 4-d FC-4 (Q-values) neural network output (if 4 actions), FC-256 corresponding to Q(s t , with weights a 1 ), Q(s t , a 2 ), Q(s t , a 3 ), 32 4x4 conv, stride 2 Q(s t ,a 4 ) A single feedforward pass 16 8x8 conv, stride 4 to compute Q-values for all actions from the current Number of actions between 4-18 state => efficient! depending on Atari game Current state s t : 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 48

  17. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function (from before) Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: Iteratively try to make the Q-value where close to the target value (y i ) it should have, if Q-function corresponds to optimal Q* (and Backward Pass optimal policy � *) Gradient update (with respect to Q-function parameters θ): Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 49

  18. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 50

  19. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 51

  20. [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (s t , a t , r t , s t+1 ) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 52

  21. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 53

  22. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 54

  23. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 55

  24. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 56

  25. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 57

  26. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 58

  27. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Take the action (a t ), and observe the reward r t and next state s t+1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 59

  28. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 60

  29. [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 61

  30. https://www.youtube.com/watch?v=V1eYniJ0Rnk Video by Károly Zsolnai-Fehér. Reproduced with permission. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 62

  31. Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 63

  32. Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies? Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 64

  33. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 65

  34. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 66

  35. Policy Gradients Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 67

  36. REINFORCE algorithm Mathematically, we can write: Where r( � ) is the reward of a trajectory Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 68

  37. REINFORCE algorithm Expected reward: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 69

  38. REINFORCE algorithm Expected reward: Now let’s differentiate this: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 70

  39. REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 71

  40. REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 72

  41. REINFORCE algorithm Expected reward: Intractable! Gradient of an Now let’s differentiate this: expectation is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 73

  42. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 74

  43. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 75

  44. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 76

  45. REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Doesn’t depend on And when differentiating: transition probabilities! Therefore when sampling a trajectory � , we can estimate J( � ) with Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 77

  46. Intuition Gradient estimator: Interpretation: - If r( � ) is high, push up the probabilities of the actions seen - If r( � ) is low, push down the probabilities of the actions seen Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 78

  47. Intuition Gradient estimator: Interpretation: - If r( � ) is high, push up the probabilities of the actions seen - If r( � ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 79

  48. Intuition Gradient estimator: Interpretation: - If r( � ) is high, push up the probabilities of the actions seen - If r( � ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator? Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 80

  49. Variance reduction Gradient estimator: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 81

  50. Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 82

  51. Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea: Use discount factor � to ignore delayed effects Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 83

  52. Variance reduction: Baseline Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions. What is important then? Whether a reward is better or worse than what you expect to get Idea: Introduce a baseline function dependent on the state. Concretely, estimator is now: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 84

  53. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 85

  54. How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Variance reduction techniques seen so far are typically used in “Vanilla REINFORCE” Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 86

  55. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state . Q: What does this remind you of? Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 87

  56. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state . Q: What does this remind you of? A: Q-function and value function! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 88

  57. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state . Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action a t in a state s t if is large. On the contrary, we are unhappy with an action if it’s small. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 89

  58. How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state . Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action a t in a state s t if is large. On the contrary, we are unhappy with an action if it’s small. Using this, we get the estimator: Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 90

  59. Actor-Critic Algorithm Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function). - The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust - Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy - Can also incorporate Q-learning tricks e.g. experience replay - Remark: we can define by the advantage function how much an action was better than expected Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 91

  60. Actor-Critic Algorithm Initialize policy parameters � , critic parameters � For iteration=1, 2 … do Sample m trajectories under the current policy For i=1, …, m do For t=1, ... , T do End for Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 92

  61. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 93

  62. REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 94

  63. REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) Input NN image [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 95

  64. REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) Input NN NN image [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 96

  65. REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) Input NN NN NN image [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 97

  66. REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) Input NN NN NN NN image [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 98

  67. REINFORCE in action: Recurrent Attention Model (RAM) (x 1 , y 1 ) (x 2 , y 2 ) (x 3 , y 3 ) (x 4 , y 4 ) (x 5 , y 5 ) Softmax y=2 Input NN NN NN NN NN image [Mnih et al. 2014] Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 99

  68. REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering! [Mnih et al. 2014] 10 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 0

Recommend


More recommend