lecture 21 reinforcement learning

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 Lecture 21 - 1 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson December 4, 2019 Lecture 21 - 2


  1. Value Function and Q Function Following a policy ๐œŒ produces sample trajectories (or paths) s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , โ€ฆ How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: ๐‘Š < ๐‘ก = ๐”ฝ > ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐œŒ +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: ๐‘… < ๐‘ก, ๐‘ = ๐”ฝ > ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐‘ . = ๐‘, ๐œŒ +?. Justin Johnson December 4, 2019 Lecture 21 - 35

  2. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy ๐œŒ * It gives the max possible future reward when taking action a in state s: ๐‘… โˆ— ๐‘ก, ๐‘ = max ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐‘ . = ๐‘, ๐œŒ ๐”ฝ > < +?. Justin Johnson December 4, 2019 Lecture 21 - 36

  3. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy ๐œŒ * It gives the max possible future reward when taking action a in state s: ๐‘… โˆ— ๐‘ก, ๐‘ = max ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐‘ . = ๐‘, ๐œŒ ๐”ฝ > < +?. Q* encodes the optimal policy: ๐œŒ โˆ— ๐‘ก = arg max BC ๐‘…(๐‘ก, ๐‘ C ) Justin Johnson December 4, 2019 Lecture 21 - 37

  4. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy ๐œŒ * It gives the max possible future reward when taking action a in state s: ๐‘… โˆ— ๐‘ก, ๐‘ = max ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐‘ . = ๐‘, ๐œŒ ๐”ฝ > < +?. Q* encodes the optimal policy: ๐œŒ โˆ— ๐‘ก = arg max BC ๐‘…(๐‘ก, ๐‘ C ) Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Justin Johnson December 4, 2019 Lecture 21 - 38

  5. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy ๐œŒ * It gives the max possible future reward when taking action a in state s: ๐‘… โˆ— ๐‘ก, ๐‘ = max ๐›ฟ + ๐‘  + | ๐‘ก . = ๐‘ก, ๐‘ . = ๐‘, ๐œŒ ๐”ฝ > < +?. Q* encodes the optimal policy: ๐œŒ โˆ— ๐‘ก = arg max BC ๐‘…(๐‘ก, ๐‘ C ) Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Intuition : After taking action a in state s, we get reward r and move to a new BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ state sโ€™. After that, the max possible reward we can get is max Justin Johnson December 4, 2019 Lecture 21 - 39

  6. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Justin Johnson December 4, 2019 Lecture 21 - 40

  7. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC ๐‘… H ๐‘ก C , ๐‘โ€ฒ ๐‘… H23 ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Justin Johnson December 4, 2019 Lecture 21 - 41

  8. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC ๐‘… H ๐‘ก C , ๐‘โ€ฒ ๐‘… H23 ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Amazing fact : Q i converges to Q * as ๐‘— โ†’ โˆž Justin Johnson December 4, 2019 Lecture 21 - 42

  9. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC ๐‘… H ๐‘ก C , ๐‘โ€ฒ ๐‘… H23 ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Amazing fact : Q i converges to Q * as ๐‘— โ†’ โˆž Problem : Need to keep track of Q(s, a) for all (state, action) pairs โ€“ impossible if infinite Justin Johnson December 4, 2019 Lecture 21 - 43

  10. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC ๐‘… H ๐‘ก C , ๐‘โ€ฒ ๐‘… H23 ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Amazing fact : Q i converges to Q * as ๐‘— โ†’ โˆž Problem : Need to keep track of Q(s, a) for all (state, action) pairs โ€“ impossible if infinite Solution : Approximate Q(s, a) with a neural network, use Bellman Equation as loss! Justin Johnson December 4, 2019 Lecture 21 - 44

  11. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Train a neural network (with weights ฮธ) to approximate Q * : ๐‘… โˆ— ๐‘ก, ๐‘ โ‰ˆ ๐‘… ๐‘ก, ๐‘; ๐œ„ Justin Johnson December 4, 2019 Lecture 21 - 45

  12. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Train a neural network (with weights ฮธ) to approximate Q * : ๐‘… โˆ— ๐‘ก, ๐‘ โ‰ˆ ๐‘… ๐‘ก, ๐‘; ๐œ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ Justin Johnson December 4, 2019 Lecture 21 - 46

  13. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Train a neural network (with weights ฮธ) to approximate Q * : ๐‘… โˆ— ๐‘ก, ๐‘ โ‰ˆ ๐‘… ๐‘ก, ๐‘; ๐œ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ S Use this to define the loss for training Q: ๐‘€ ๐‘ก, ๐‘ = ๐‘… ๐‘ก, ๐‘; ๐œ„ โˆ’ ๐‘ง E,B,P Justin Johnson December 4, 2019 Lecture 21 - 47

  14. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Train a neural network (with weights ฮธ) to approximate Q * : ๐‘… โˆ— ๐‘ก, ๐‘ โ‰ˆ ๐‘… ๐‘ก, ๐‘; ๐œ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ S Use this to define the loss for training Q: ๐‘€ ๐‘ก, ๐‘ = ๐‘… ๐‘ก, ๐‘; ๐œ„ โˆ’ ๐‘ง E,B,P Problem : Nonstationary! The โ€œtargetโ€ for Q(s, a) depends on the current weights ฮธ! Justin Johnson December 4, 2019 Lecture 21 - 48

  15. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: ๐‘… โˆ— ๐‘ก, ๐‘ = ๐”ฝ D,EC ๐‘  + ๐›ฟ max BC ๐‘… โˆ— ๐‘ก C , ๐‘โ€ฒ Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„(๐‘ก, ๐‘) Train a neural network (with weights ฮธ) to approximate Q * : ๐‘… โˆ— ๐‘ก, ๐‘ โ‰ˆ ๐‘… ๐‘ก, ๐‘; ๐œ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ S Use this to define the loss for training Q: ๐‘€ ๐‘ก, ๐‘ = ๐‘… ๐‘ก, ๐‘; ๐œ„ โˆ’ ๐‘ง E,B,P Problem : Nonstationary! The โ€œtargetโ€ for Q(s, a) depends on the current weights ฮธ! Problem : How to sample batches of data for training? Justin Johnson December 4, 2019 Lecture 21 - 49

  16. Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, โ€œPlaying Atari with Deep Reinforcement Learningโ€, NeurIPS Deep Learning Workshop, 2013 Justin Johnson December 4, 2019 Lecture 21 - 50

  17. Mnih et al, โ€œPlaying Atari with Deep Reinforcement Learningโ€, NeurIPS Deep Learning Workshop, 2013 Case Study: Playing Atari Games Network output : Q-values for all actions With 4 actions: last ๐‘… ๐‘ก, ๐‘; ๐œ„ layer gives values FC-A (Q-values) Neural network Q(s t , a 1 ), Q(s t , a 2 ), FC-256 with weights ฮธ Q(s t , a 3 ), Q(s t ,a 4 ) Conv(16->32, 4x4, stride 2) Conv(4->16, 8x8, stride 4) Network input: state s t : 4x84x84 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Justin Johnson December 4, 2019 Lecture 21 - 51

  18. https://www.youtube.com/watch?v=V1eYniJ0Rnk Justin Johnson December 4, 2019 Lecture 21 - 52

  19. Q-Learning Q-Learning : Train network ๐‘… P ๐‘ก, ๐‘ to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Justin Johnson December 4, 2019 Lecture 21 - 53

  20. Q-Learning vs Policy Gradients Q-Learning : Train network ๐‘… P ๐‘ก, ๐‘ to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Justin Johnson December 4, 2019 Lecture 21 - 54

  21. Q-Learning vs Policy Gradients Q-Learning : Train network ๐‘… P ๐‘ก, ๐‘ to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Objective function : Expected future rewards when following policy ๐œŒ P : ๐›ฟ + ๐‘  + ๐พ ๐œ„ = ๐”ฝ D~U V > +?. Find the optimal policy by maximizing: ๐œ„ โˆ— = arg max P ๐พ ๐œ„ (Use gradient ascent!) Justin Johnson December 4, 2019 Lecture 21 - 55

  22. Policy Gradients Objective function : Expected future rewards when following policy ๐œŒ P : ๐›ฟ + ๐‘  + ๐พ ๐œ„ = ๐”ฝ D~U V > +?. Find the optimal policy by maximizing: ๐œ„ โˆ— = arg max P ๐พ ๐œ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Donโ€™t know how to compute WP Justin Johnson December 4, 2019 Lecture 21 - 56

  23. Policy Gradients Objective function : Expected future rewards when following policy ๐œŒ P : ๐›ฟ + ๐‘  + ๐พ ๐œ„ = ๐”ฝ D~U V > +?. Find the optimal policy by maximizing: ๐œ„ โˆ— = arg max P ๐พ ๐œ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Donโ€™t know how to compute WP WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 57

  24. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 58

  25. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 59

  26. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 60

  27. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ ๐œ– 1 ๐œ–๐œ„ ๐‘ž P ๐‘ฆ โ‡’ ๐œ– ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐œ–๐œ„ ๐‘ž P ๐‘ฆ = ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ Justin Johnson December 4, 2019 Lecture 21 - 61

  28. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ ๐œ– 1 ๐œ–๐œ„ ๐‘ž P ๐‘ฆ โ‡’ ๐œ– ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐œ–๐œ„ ๐‘ž P ๐‘ฆ = ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ Justin Johnson December 4, 2019 Lecture 21 - 62

  29. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ ๐œ– 1 ๐œ–๐œ„ ๐‘ž P ๐‘ฆ โ‡’ ๐œ– ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐œ–๐œ„ ๐‘ž P ๐‘ฆ = ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ Justin Johnson December 4, 2019 Lecture 21 - 63

  30. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ ๐œ– 1 ๐œ–๐œ„ ๐‘ž P ๐‘ฆ โ‡’ ๐œ– ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐œ–๐œ„ ๐‘ž P ๐‘ฆ = ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ ๐œ–๐พ ๐œ– ๐œ– ๐œ–๐œ„ = ] ๐‘” ๐‘ฆ ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ^ Justin Johnson December 4, 2019 Lecture 21 - 64

  31. Policy Gradients: REINFORCE Algorithm WX General formulation : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ Want to compute WP ๐œ–๐œ„ = ๐œ– ๐œ–๐พ = ๐œ– ๐œ– ๐œ–๐œ„ ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ ] ๐‘ž P ๐‘ฆ ๐‘” ๐‘ฆ ๐‘’๐‘ฆ = ] ๐‘” ๐‘ฆ ๐œ–๐œ„ ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ ^ ^ ๐œ– 1 ๐œ–๐œ„ ๐‘ž P ๐‘ฆ โ‡’ ๐œ– ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐œ–๐œ„ ๐‘ž P ๐‘ฆ = ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ ๐œ–๐พ ๐œ– ๐œ– ๐œ–๐œ„ = ] ๐‘” ๐‘ฆ ๐‘ž P ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ๐‘’๐‘ฆ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ ^ Approximate the expectation via sampling! Justin Johnson December 4, 2019 Lecture 21 - 65

  32. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 66

  33. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 67

  34. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Transition probabilities of environment. We canโ€™t compute this. Justin Johnson December 4, 2019 Lecture 21 - 68

  35. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Transition probabilities Action probabilities of environment. We of policy. We can canโ€™t compute this. are learning this! Justin Johnson December 4, 2019 Lecture 21 - 69

  36. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Transition probabilities Action probabilities ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + of environment. We of policy. We can canโ€™t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 70

  37. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐‘ž P ๐‘ฆ = d ๐‘„ ๐‘ก +23 | ๐‘ก + ๐œŒ P ๐‘ + | ๐‘ก + โ‡’ log ๐‘ž P (๐‘ฆ) = > log ๐‘„ ๐‘ก +23 |๐‘ก + + log ๐œŒ P ๐‘ + |๐‘ก + +?. +?. Transition probabilities Action probabilities ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + of environment. We of policy. We can canโ€™t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 71

  38. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐œ– ๐œ– ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 72

  39. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐œ– ๐œ– Expected reward under ๐œŒ P : ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ +?. ๐œ–๐พ ๐œ– ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 73

  40. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ ๐œ– ๐œ– Expected reward under ๐œŒ P : ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ +?. ๐œ–๐พ ๐œ– ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐œ„ log ๐‘ž P ๐‘ฆ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 74

  41. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ Expected reward under ๐œŒ P : ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 75

  42. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ Expected reward under ๐œŒ P : Sequence of states ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ and actions when ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ ๐’š~๐’’ ๐œพ ๐‘” ๐‘ฆ > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + following policy ๐† ๐œพ +?. Justin Johnson December 4, 2019 Lecture 21 - 76

  43. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ Expected reward under ๐œŒ P : Reward we get from ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ state sequence x ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐’ˆ ๐’š > ๐œ–๐œ„ log ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 77

  44. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ Gradient of predicted Expected reward under ๐œŒ P : action scores with ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ respect to model ๐œ–๐พ ๐ weights. Backprop ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐๐œพ ๐’Ž๐’‘๐’‰ ๐† ๐œพ ๐’ƒ ๐’– |๐’• ๐’– through model ๐† ๐œพ ! +?. Justin Johnson December 4, 2019 Lecture 21 - 78

  45. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ 1. Initialize random weights ฮธ Expected reward under ๐œŒ P : 2. Collect trajectories x and ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ rewards f(x) using policy ๐œŒ P 3. Compute dJ/dฮธ ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + 4. Gradient ascent step on ฮธ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 79

  46. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ 1. Initialize random weights ฮธ Expected reward under ๐œŒ P : 2. Collect trajectories x and ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ rewards f(x) using policy ๐œŒ P 3. Compute dJ/dฮธ ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + 4. Gradient ascent step on ฮธ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 80

  47. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ 1. Initialize random weights ฮธ Expected reward under ๐œŒ P : 2. Collect trajectories x and ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ rewards f(x) using policy ๐œŒ P 3. Compute dJ/dฮธ ๐œ–๐พ ๐œ– ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + 4. Gradient ascent step on ฮธ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 81

  48. Policy Gradients: REINFORCE Algorithm Goal : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state Define : Let ๐‘ฆ = ๐‘ก . , ๐‘ . , ๐‘ก 3 , ๐‘ 3 , โ€ฆ be the sequence of states and actions we get when following policy ๐œŒ P . Itโ€™s random: ๐‘ฆ~๐‘ž P ๐‘ฆ Intuition : Expected reward under ๐œŒ P : When f(x) is high: Increase the ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ probability of the actions we took. When f(x) is low: Decrease the ๐œ–๐พ ๐œ– probability of the actions we took. ๐œ–๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ > ๐œ–๐œ„ ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + +?. Justin Johnson December 4, 2019 Lecture 21 - 82

  49. So far: Q-Learning and Policy Gradients Q-Learning : Train network ๐‘… P ๐‘ก, ๐‘ to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max S ๐‘€ ๐‘ก, ๐‘ = ๐‘… ๐‘ก, ๐‘; ๐œ„ โˆ’ ๐‘ง E,B,P Policy Gradients : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = ๐”ฝ Y~U V ๐‘” ๐‘ฆ โˆ‘ +?. ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ WP ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + Justin Johnson December 4, 2019 Lecture 21 - 83

  50. So far: Q-Learning and Policy Gradients Q-Learning : Train network ๐‘… P ๐‘ก, ๐‘ to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC ๐‘…(๐‘ก C , ๐‘ C ; ๐œ„) Where ๐‘ ~๐‘† ๐‘ก, ๐‘ , ๐‘ก C ~๐‘„ ๐‘ก, ๐‘ ๐‘ง E,B,P = ๐”ฝ D,EC ๐‘  + ๐›ฟ max S ๐‘€ ๐‘ก, ๐‘ = ๐‘… ๐‘ก, ๐‘; ๐œ„ โˆ’ ๐‘ง E,B,P Policy Gradients : Train a network ๐œŒ P ๐‘ ๐‘ก) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = ๐”ฝ Y~U V ๐‘” ๐‘ฆ โˆ‘ +?. ๐พ ๐œ„ = ๐”ฝ Y~U V ๐‘” ๐‘ฆ WP ๐‘š๐‘๐‘• ๐œŒ P ๐‘ + |๐‘ก + Improving policy gradients: Add baseline to reduce variance of gradient estimator Justin Johnson December 4, 2019 Lecture 21 - 84

  51. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, โ€œReinforcement Learning: An Introductionโ€, 1998; Degris et al, โ€œModel-free reinforcement learning with continuous action in practiceโ€, 2012; Mnih et al, โ€œAsynchronous Methods for Deep Reinforcement Learningโ€, ICML 2016 Model-Based : Learn a model of the worldโ€™s state transition function ๐‘„(๐‘ก +23 |๐‘ก + , ๐‘ + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, โ€œAlgorithms for Inverse Reinforcement Learningโ€, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, โ€œGenerative Adversarial Imitation Learningโ€, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 85

  52. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, โ€œReinforcement Learning: An Introductionโ€, 1998; Degris et al, โ€œModel-free reinforcement learning with continuous action in practiceโ€, 2012; Mnih et al, โ€œAsynchronous Methods for Deep Reinforcement Learningโ€, ICML 2016 Model-Based : Learn a model of the worldโ€™s state transition function ๐‘„(๐‘ก +23 |๐‘ก + , ๐‘ + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, โ€œAlgorithms for Inverse Reinforcement Learningโ€, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, โ€œGenerative Adversarial Imitation Learningโ€, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 86

  53. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, โ€œReinforcement Learning: An Introductionโ€, 1998; Degris et al, โ€œModel-free reinforcement learning with continuous action in practiceโ€, 2012; Mnih et al, โ€œAsynchronous Methods for Deep Reinforcement Learningโ€, ICML 2016 Model-Based : Learn a model of the worldโ€™s state transition function ๐‘„(๐‘ก +23 |๐‘ก + , ๐‘ + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, โ€œAlgorithms for Inverse Reinforcement Learningโ€, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, โ€œGenerative Adversarial Imitation Learningโ€, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 87

  54. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, โ€œReinforcement Learning: An Introductionโ€, 1998; Degris et al, โ€œModel-free reinforcement learning with continuous action in practiceโ€, 2012; Mnih et al, โ€œAsynchronous Methods for Deep Reinforcement Learningโ€, ICML 2016 Model-Based : Learn a model of the worldโ€™s state transition function ๐‘„(๐‘ก +23 |๐‘ก + , ๐‘ + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, โ€œAlgorithms for Inverse Reinforcement Learningโ€, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, โ€œGenerative Adversarial Imitation Learningโ€, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 88

  55. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, โ€œReinforcement Learning: An Introductionโ€, 1998; Degris et al, โ€œModel-free reinforcement learning with continuous action in practiceโ€, 2012; Mnih et al, โ€œAsynchronous Methods for Deep Reinforcement Learningโ€, ICML 2016 Model-Based : Learn a model of the worldโ€™s state transition function ๐‘„(๐‘ก +23 |๐‘ก + , ๐‘ + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, โ€œAlgorithms for Inverse Reinforcement Learningโ€, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, โ€œGenerative Adversarial Imitation Learningโ€, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 89

  56. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, โ€œMastering the game of Go with deep neural networks and tree searchโ€, Nature 2016 Silver et al, โ€œMastering the game of Go without human knowledgeโ€, Nature 2017 Silver et al, โ€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-playโ€, Science 2018 This image is CC0 public domain Schrittwieser et al, โ€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelโ€, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 90

  57. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, โ€œMastering the game of Go with deep neural networks and tree searchโ€, Nature 2016 Silver et al, โ€œMastering the game of Go without human knowledgeโ€, Nature 2017 Silver et al, โ€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-playโ€, Science 2018 This image is CC0 public domain Schrittwieser et al, โ€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelโ€, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 91

  58. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, โ€œMastering the game of Go with deep neural networks and tree searchโ€, Nature 2016 Silver et al, โ€œMastering the game of Go without human knowledgeโ€, Nature 2017 Silver et al, โ€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-playโ€, Science 2018 This image is CC0 public domain Schrittwieser et al, โ€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelโ€, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 92

  59. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, โ€œMastering the game of Go with deep neural networks and tree searchโ€, Nature 2016 Silver et al, โ€œMastering the game of Go without human knowledgeโ€, Nature 2017 Silver et al, โ€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-playโ€, Science 2018 This image is CC0 public domain Schrittwieser et al, โ€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelโ€, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 93

  60. Case Study: Playing Games November 2019: Lee Sedol announces retirement AlphaGo : (January 2016) - Used imitation learning + tree search + RL โ€œWith the debut of AI - Beat 18-time world champion Lee Sedol in Go games, I've AlphaGo Zero (October 2017) realized that I'm not at - Simplified version of AlphaGo the top even if I become the number - No longer using imitation learning one through frantic - Beat (at the time) #1 ranked Ke Jie effortsโ€ Alpha Zero (December 2018) โ€œEven if I become the - Generalized to other games: Chess and Shogi number one, there is MuZero (November 2019) an entity that cannot - Plans through a learned model of the game be defeatedโ€ Silver et al, โ€œMastering the game of Go with deep neural networks and tree searchโ€, Nature 2016 Silver et al, โ€œMastering the game of Go without human knowledgeโ€, Nature 2017 Quotes from: https://en.yna.co.kr/view/AEN20191127004800315 Silver et al, โ€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-playโ€, Science 2018 Image of Lee Sedol is licensed under CC BY 2.0 Schrittwieser et al, โ€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelโ€, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 94

  61. More Complex Games StarCraft II: AlphaStar Dota 2 : OpenAI Five (April 2019) (October 2019) No paper, only a blog post: Vinyals et al, โ€œGrandmaster https://openai.com/five/#how- level in StarCraft II using openai-five-works multi-agent reinforcement learningโ€, Science 2018 Justin Johnson December 4, 2019 Lecture 21 - 95

  62. Reinforcement Learning: Interacting With World Ac#on Agent Environment Reward Normally we use RL to train agents that interact with a (noisy, nondifferentiable) environment Justin Johnson December 4, 2019 Lecture 21 - 96

  63. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Justin Johnson December 4, 2019 Lecture 21 - 97

  64. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small โ€œroutingโ€ network sends image to one of K networks CNN CNN CNN CNN Justin Johnson December 4, 2019 Lecture 21 - 98

  65. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small โ€œroutingโ€ network sends image to one of K networks CNN Which network to use? P(orange) = 0.2 CNN CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 99

  66. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small โ€œroutingโ€ network sends image to one of K networks CNN Which network to use? Sample: P(orange) = 0.2 CNN Green CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 100

Recommend


More recommend