deep rl
play

Deep RL Robert Platt Northeastern University Q-learning - PowerPoint PPT Presentation

Deep RL Robert Platt Northeastern University Q-learning Q-function Q action argmax state action World e t a t s Update rule Q-learning Q-function Q action argmax state action World e t a t s Update rule Deep Q-learning


  1. Deep RL Robert Platt Northeastern University

  2. Q-learning Q-function Q action argmax state action World e t a t s Update rule

  3. Q-learning Q-function Q action argmax state action World e t a t s Update rule

  4. Deep Q-learning (DQN) Q-function argmax state action World Values of different possible discrete actions

  5. Deep Q-learning (DQN) Q-function argmax state action World But, why would we want to do this?

  6. Where does “state” come from? Agent takes actions a Agent s,r Agent perceives states and rewards Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

  7. Where does “state” come from? Agent takes actions a Agent s,r Is it possible to do RL WITHOUT Agent perceives states and rewards hand-coding states? Earlier, we dodged this question: “it’s part of the MDP problem statement” But, that’s a cop out. How do we get state? Typically can’t use “raw” sensor data as state w/ a tabular Q-function – it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

  8. DQN

  9. DQN Instead of state, we have an image – in practice, it could be a history of the k most recent images stacked as a single k -channel image Hopefully this new image representation is Markov… – in some domains, it might not be!

  10. DQN Stack of images Q-function Conv 1 Conv 2 FC 1 Output

  11. DQN Stack of images Q-function Conv 1 Conv 2 FC 1 Output

  12. DQN Num output nodes equals the number of actions Stack of images Q-function Conv 1 Conv 2 FC 1 Output

  13. Q-function updates in DQN Here’s the standard Q-learning update equation:

  14. Q-function updates in DQN Here’s the standard Q-learning update equation:

  15. Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting:

  16. Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target” This equation adjusts Q(s,a) in the direction of the target

  17. Q-function updates in DQN Here’s the standard Q-learning update equation: Rewriting: let’s call this the “target” This equation adjusts Q(s,a) in the direction of the target We’re going to accomplish this same thing in a different way using neural networks...

  18. Q-function updates in DQN Use this loss function:

  19. Q-function updates in DQN Use this loss function: Notice that Q is now parameterized by the weights, w

  20. Q-function updates in DQN I’m including the bias in the weights Use this loss function:

  21. Q-function updates in DQN Use this loss function: target

  22. Question Use this loss function: target What’s this called?

  23. Q-function updates in DQN Use this loss function: target We’re going to optimize this loss function using the following gradient:

  24. Think-pair-share Use this loss function: target We’re going to optimize this loss function using the following gradient: What’s wrong with this?

  25. Q-function updates in DQN Use this loss function: target We’re going to optimize this loss function using the following gradient: What’s wrong with this? We call this the semigradient rather than the gradient – semi-gradient descent still converges – this is often more convenient

  26. “Barebones” DQN Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Until s is terminal Where:

  27. “Barebones” DQN Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ This is all that changed Until s is terminal relative to standard q-learning Where:

  28. Example: 4x4 frozen lake env Get to the goal (G) Don’t fall in a hole (H) Demo!

  29. Think-pair-share Suppose the “barebones” DQN algorithm w/ this DQN network experiences the following transition: Which weights in the network could be updated on this iteration?

  30. Experience replay Deep learning typically assumes independent, identically distributed (IID) training data

  31. Experience replay Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario? Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Until s is terminal

  32. Experience replay Deep learning typically assumes independent, identically distributed (IID) training data But is this true in the deep RL scenario? Initialize Q( s,a;w ) with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Our solution: buffer experiences and then “replay” them during training Until s is terminal

  33. Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer

  34. Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer Where:

  35. Experience replay Initialize with random weights Replay buffer Repeat (for each episode): Buffers like this are pretty common in DL Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Add this exp to buffer Train every If mod( step,trainfreq ) == 0: trainfreq steps sample batch B from D One step grad descent WRT buffer Where:

  36. Think-pair-share Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D What do you think are the tradeoffs between: – large replay buffer vs small replay buffer? – large batch size vs small batch size?

  37. With target network Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ Target network helps stabilize DL If mod( step,trainfreq ) == 0: – why ? sample batch B from D if mod( step,copyfreq ) == 0: Where:

  38. Example: 4x4 frozen lake env Get to the goal (G) Don’t fall in a hole (H) Demo!

  39. Comparison: replay vs no replay (Avg final score achieved)

  40. Double DQN Recall the problem of maximization bias:

  41. Double DQN Recall the problem of maximization bias: Our solution from the TD lecture: Can we adapt this to the DQN setting?

  42. Double DQN Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D if mod( step,copyfreq ) == 0: Where:

  43. Think-pair-share Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ 1. In what sense is this double q-learning? If mod( step,trainfreq ) == 0: 2. What are the pros/cons vs earlier sample batch B from D version of double-Q? 3. Why not convert the original if mod( step,copyfreq ) == 0: double-Q algorithm into a deep version? Where:

  44. Double DQN

  45. Double DQN

  46. Prioritized Replay Buffer Initialize with random weights Repeat (for each episode): Initialize s Repeat (for each step of the episode): Choose a from s using policy derived from Q (e.g. e-greedy) Take action a , observe r, s’ If mod( step,trainfreq ) == 0: sample batch B from D Previously this sample was uniformly random Can we do better by sampling the batch intelligently?

  47. Prioritized Replay Buffer – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

  48. Question Why is the sampling method particularly important in this Domain? – Left action transitions to state 1 w/ zero reward – Far right state gets reward of 1

Recommend


More recommend