lecture 21 reinforcement learning
play

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 Lecture 21 - 1 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson December 4, 2019 Lecture 21 - 2


  1. Value Function and Q Function Following a policy 𝜌 produces sample trajectories (or paths) s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: π‘Š < 𝑑 = 𝔽 > 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝜌 +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 𝑅 < 𝑑, 𝑏 = 𝔽 > 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 +?. Justin Johnson December 4, 2019 Lecture 21 - 35

  2. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Justin Johnson December 4, 2019 Lecture 21 - 36

  3. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Justin Johnson December 4, 2019 Lecture 21 - 37

  4. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 38

  5. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Intuition : After taking action a in state s, we get reward r and move to a new BC 𝑅 βˆ— 𝑑 C , 𝑏′ state s’. After that, the max possible reward we can get is max Justin Johnson December 4, 2019 Lecture 21 - 39

  6. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Justin Johnson December 4, 2019 Lecture 21 - 40

  7. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 41

  8. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Justin Johnson December 4, 2019 Lecture 21 - 42

  9. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Justin Johnson December 4, 2019 Lecture 21 - 43

  10. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Solution : Approximate Q(s, a) with a neural network, use Bellman Equation as loss! Justin Johnson December 4, 2019 Lecture 21 - 44

  11. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Justin Johnson December 4, 2019 Lecture 21 - 45

  12. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 Justin Johnson December 4, 2019 Lecture 21 - 46

  13. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Justin Johnson December 4, 2019 Lecture 21 - 47

  14. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Problem : Nonstationary! The β€œtarget” for Q(s, a) depends on the current weights ΞΈ! Justin Johnson December 4, 2019 Lecture 21 - 48

  15. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Problem : Nonstationary! The β€œtarget” for Q(s, a) depends on the current weights ΞΈ! Problem : How to sample batches of data for training? Justin Johnson December 4, 2019 Lecture 21 - 49

  16. Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, β€œPlaying Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson December 4, 2019 Lecture 21 - 50

  17. Mnih et al, β€œPlaying Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Case Study: Playing Atari Games Network output : Q-values for all actions With 4 actions: last 𝑅 𝑑, 𝑏; πœ„ layer gives values FC-A (Q-values) Neural network Q(s t , a 1 ), Q(s t , a 2 ), FC-256 with weights ΞΈ Q(s t , a 3 ), Q(s t ,a 4 ) Conv(16->32, 4x4, stride 2) Conv(4->16, 8x8, stride 4) Network input: state s t : 4x84x84 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Justin Johnson December 4, 2019 Lecture 21 - 51

  18. https://www.youtube.com/watch?v=V1eYniJ0Rnk Justin Johnson December 4, 2019 Lecture 21 - 52

  19. Q-Learning Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Justin Johnson December 4, 2019 Lecture 21 - 53

  20. Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Justin Johnson December 4, 2019 Lecture 21 - 54

  21. Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) Justin Johnson December 4, 2019 Lecture 21 - 55

  22. Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP Justin Johnson December 4, 2019 Lecture 21 - 56

  23. Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 57

  24. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 58

  25. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 59

  26. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 60

  27. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 61

  28. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 62

  29. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 63

  30. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 πœ–πΎ πœ– πœ– πœ–πœ„ = ] 𝑔 𝑦 π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 ^ Justin Johnson December 4, 2019 Lecture 21 - 64

  31. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 πœ–πΎ πœ– πœ– πœ–πœ„ = ] 𝑔 𝑦 π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 ^ Approximate the expectation via sampling! Justin Johnson December 4, 2019 Lecture 21 - 65

  32. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 66

  33. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 67

  34. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities of environment. We can’t compute this. Justin Johnson December 4, 2019 Lecture 21 - 68

  35. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities of environment. We of policy. We can can’t compute this. are learning this! Justin Johnson December 4, 2019 Lecture 21 - 69

  36. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 70

  37. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 71

  38. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 72

  39. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– Expected reward under 𝜌 P : πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 +?. πœ–πΎ πœ– πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 73

  40. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– Expected reward under 𝜌 P : πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 +?. πœ–πΎ πœ– πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 74

  41. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 75

  42. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : Sequence of states 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 and actions when πœ–πΎ πœ– πœ–πœ„ = 𝔽 π’š~𝒒 𝜾 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + following policy 𝝆 𝜾 +?. Justin Johnson December 4, 2019 Lecture 21 - 76

  43. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : Reward we get from 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 state sequence x πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V π’ˆ π’š > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 77

  44. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Gradient of predicted Expected reward under 𝜌 P : action scores with 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 respect to model πœ–πΎ 𝝐 weights. Backprop πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > 𝝐𝜾 π’Žπ’‘π’‰ 𝝆 𝜾 𝒃 𝒖 |𝒕 𝒖 through model 𝝆 𝜾 ! +?. Justin Johnson December 4, 2019 Lecture 21 - 78

  45. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 79

  46. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 80

  47. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 81

  48. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Intuition : Expected reward under 𝜌 P : When f(x) is high: Increase the 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 probability of the actions we took. When f(x) is low: Decrease the πœ–πΎ πœ– probability of the actions we took. πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 82

  49. So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 βˆ‘ +?. 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 WP π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + Justin Johnson December 4, 2019 Lecture 21 - 83

  50. So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 βˆ‘ +?. 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 WP π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + Improving policy gradients: Add baseline to reduce variance of gradient estimator Justin Johnson December 4, 2019 Lecture 21 - 84

  51. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 85

  52. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 86

  53. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 87

  54. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 88

  55. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 89

  56. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 90

  57. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 91

  58. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 92

  59. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 93

  60. Case Study: Playing Games November 2019: Lee Sedol announces retirement AlphaGo : (January 2016) - Used imitation learning + tree search + RL β€œWith the debut of AI - Beat 18-time world champion Lee Sedol in Go games, I've AlphaGo Zero (October 2017) realized that I'm not at - Simplified version of AlphaGo the top even if I become the number - No longer using imitation learning one through frantic - Beat (at the time) #1 ranked Ke Jie efforts” Alpha Zero (December 2018) β€œEven if I become the - Generalized to other games: Chess and Shogi number one, there is MuZero (November 2019) an entity that cannot - Plans through a learned model of the game be defeated” Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Quotes from: https://en.yna.co.kr/view/AEN20191127004800315 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 Image of Lee Sedol is licensed under CC BY 2.0 Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 94

  61. More Complex Games StarCraft II: AlphaStar Dota 2 : OpenAI Five (April 2019) (October 2019) No paper, only a blog post: Vinyals et al, β€œGrandmaster https://openai.com/five/#how- level in StarCraft II using openai-five-works multi-agent reinforcement learning”, Science 2018 Justin Johnson December 4, 2019 Lecture 21 - 95

  62. Reinforcement Learning: Interacting With World Ac#on Agent Environment Reward Normally we use RL to train agents that interact with a (noisy, nondifferentiable) environment Justin Johnson December 4, 2019 Lecture 21 - 96

  63. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Justin Johnson December 4, 2019 Lecture 21 - 97

  64. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN CNN CNN CNN Justin Johnson December 4, 2019 Lecture 21 - 98

  65. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN Which network to use? P(orange) = 0.2 CNN CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 99

  66. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN Which network to use? Sample: P(orange) = 0.2 CNN Green CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 100

Recommend


More recommend