Value Function and Q Function Following a policy π produces sample trajectories (or paths) s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , β¦ How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: π < π‘ = π½ > πΏ + π + | π‘ . = π‘, π +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: π < π‘, π = π½ > πΏ + π + | π‘ . = π‘, π . = π, π +?. Justin Johnson December 4, 2019 Lecture 21 - 35
Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy π * It gives the max possible future reward when taking action a in state s: π β π‘, π = max πΏ + π + | π‘ . = π‘, π . = π, π π½ > < +?. Justin Johnson December 4, 2019 Lecture 21 - 36
Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy π * It gives the max possible future reward when taking action a in state s: π β π‘, π = max πΏ + π + | π‘ . = π‘, π . = π, π π½ > < +?. Q* encodes the optimal policy: π β π‘ = arg max BC π (π‘, π C ) Justin Johnson December 4, 2019 Lecture 21 - 37
Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy π * It gives the max possible future reward when taking action a in state s: π β π‘, π = max πΏ + π + | π‘ . = π‘, π . = π, π π½ > < +?. Q* encodes the optimal policy: π β π‘ = arg max BC π (π‘, π C ) Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Justin Johnson December 4, 2019 Lecture 21 - 38
Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy π * It gives the max possible future reward when taking action a in state s: π β π‘, π = max πΏ + π + | π‘ . = π‘, π . = π, π π½ > < +?. Q* encodes the optimal policy: π β π‘ = arg max BC π (π‘, π C ) Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Intuition : After taking action a in state s, we get reward r and move to a new BC π β π‘ C , πβ² state sβ. After that, the max possible reward we can get is max Justin Johnson December 4, 2019 Lecture 21 - 39
Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Justin Johnson December 4, 2019 Lecture 21 - 40
Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC π H π‘ C , πβ² π H23 π‘, π = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π(π‘, π) Justin Johnson December 4, 2019 Lecture 21 - 41
Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC π H π‘ C , πβ² π H23 π‘, π = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π(π‘, π) Amazing fact : Q i converges to Q * as π β β Justin Johnson December 4, 2019 Lecture 21 - 42
Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC π H π‘ C , πβ² π H23 π‘, π = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π(π‘, π) Amazing fact : Q i converges to Q * as π β β Problem : Need to keep track of Q(s, a) for all (state, action) pairs β impossible if infinite Justin Johnson December 4, 2019 Lecture 21 - 43
Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC π H π‘ C , πβ² π H23 π‘, π = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π(π‘, π) Amazing fact : Q i converges to Q * as π β β Problem : Need to keep track of Q(s, a) for all (state, action) pairs β impossible if infinite Solution : Approximate Q(s, a) with a neural network, use Bellman Equation as loss! Justin Johnson December 4, 2019 Lecture 21 - 44
Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Train a neural network (with weights ΞΈ) to approximate Q * : π β π‘, π β π π‘, π; π Justin Johnson December 4, 2019 Lecture 21 - 45
Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Train a neural network (with weights ΞΈ) to approximate Q * : π β π‘, π β π π‘, π; π Use the Bellman Equation to tell what Q should output for a given state and action: BC π (π‘ C , π C ; π) π§ E,B,P = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π π‘, π Justin Johnson December 4, 2019 Lecture 21 - 46
Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Train a neural network (with weights ΞΈ) to approximate Q * : π β π‘, π β π π‘, π; π Use the Bellman Equation to tell what Q should output for a given state and action: BC π (π‘ C , π C ; π) π§ E,B,P = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π π‘, π S Use this to define the loss for training Q: π π‘, π = π π‘, π; π β π§ E,B,P Justin Johnson December 4, 2019 Lecture 21 - 47
Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Train a neural network (with weights ΞΈ) to approximate Q * : π β π‘, π β π π‘, π; π Use the Bellman Equation to tell what Q should output for a given state and action: BC π (π‘ C , π C ; π) π§ E,B,P = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π π‘, π S Use this to define the loss for training Q: π π‘, π = π π‘, π; π β π§ E,B,P Problem : Nonstationary! The βtargetβ for Q(s, a) depends on the current weights ΞΈ! Justin Johnson December 4, 2019 Lecture 21 - 48
Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: π β π‘, π = π½ D,EC π + πΏ max BC π β π‘ C , πβ² Where π ~π π‘, π , π‘ C ~π(π‘, π) Train a neural network (with weights ΞΈ) to approximate Q * : π β π‘, π β π π‘, π; π Use the Bellman Equation to tell what Q should output for a given state and action: BC π (π‘ C , π C ; π) π§ E,B,P = π½ D,EC π + πΏ max Where π ~π π‘, π , π‘ C ~π π‘, π S Use this to define the loss for training Q: π π‘, π = π π‘, π; π β π§ E,B,P Problem : Nonstationary! The βtargetβ for Q(s, a) depends on the current weights ΞΈ! Problem : How to sample batches of data for training? Justin Johnson December 4, 2019 Lecture 21 - 49
Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, βPlaying Atari with Deep Reinforcement Learningβ, NeurIPS Deep Learning Workshop, 2013 Justin Johnson December 4, 2019 Lecture 21 - 50
Mnih et al, βPlaying Atari with Deep Reinforcement Learningβ, NeurIPS Deep Learning Workshop, 2013 Case Study: Playing Atari Games Network output : Q-values for all actions With 4 actions: last π π‘, π; π layer gives values FC-A (Q-values) Neural network Q(s t , a 1 ), Q(s t , a 2 ), FC-256 with weights ΞΈ Q(s t , a 3 ), Q(s t ,a 4 ) Conv(16->32, 4x4, stride 2) Conv(4->16, 8x8, stride 4) Network input: state s t : 4x84x84 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Justin Johnson December 4, 2019 Lecture 21 - 51
https://www.youtube.com/watch?v=V1eYniJ0Rnk Justin Johnson December 4, 2019 Lecture 21 - 52
Q-Learning Q-Learning : Train network π P π‘, π to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Justin Johnson December 4, 2019 Lecture 21 - 53
Q-Learning vs Policy Gradients Q-Learning : Train network π P π‘, π to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Justin Johnson December 4, 2019 Lecture 21 - 54
Q-Learning vs Policy Gradients Q-Learning : Train network π P π‘, π to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Objective function : Expected future rewards when following policy π P : πΏ + π + πΎ π = π½ D~U V > +?. Find the optimal policy by maximizing: π β = arg max P πΎ π (Use gradient ascent!) Justin Johnson December 4, 2019 Lecture 21 - 55
Policy Gradients Objective function : Expected future rewards when following policy π P : πΏ + π + πΎ π = π½ D~U V > +?. Find the optimal policy by maximizing: π β = arg max P πΎ π (Use gradient ascent!) WX Problem : Nondifferentiability! Donβt know how to compute WP Justin Johnson December 4, 2019 Lecture 21 - 56
Policy Gradients Objective function : Expected future rewards when following policy π P : πΏ + π + πΎ π = π½ D~U V > +?. Find the optimal policy by maximizing: π β = arg max P πΎ π (Use gradient ascent!) WX Problem : Nondifferentiability! Donβt know how to compute WP WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 57
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 58
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 59
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 60
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ π 1 ππ π P π¦ β π π π ππ log π P π¦ = ππ π P π¦ = π P π¦ ππ log π P π¦ π P π¦ Justin Johnson December 4, 2019 Lecture 21 - 61
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ π 1 ππ π P π¦ β π π π ππ log π P π¦ = ππ π P π¦ = π P π¦ ππ log π P π¦ π P π¦ Justin Johnson December 4, 2019 Lecture 21 - 62
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ π 1 ππ π P π¦ β π π π ππ log π P π¦ = ππ π P π¦ = π P π¦ ππ log π P π¦ π P π¦ Justin Johnson December 4, 2019 Lecture 21 - 63
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ π 1 ππ π P π¦ β π π π ππ log π P π¦ = ππ π P π¦ = π P π¦ ππ log π P π¦ π P π¦ ππΎ π π ππ = ] π π¦ π P π¦ ππ log π P π¦ ππ¦ = π½ Y~U V π π¦ ππ log π P π¦ ^ Justin Johnson December 4, 2019 Lecture 21 - 64
Policy Gradients: REINFORCE Algorithm WX General formulation : πΎ π = π½ Y~U V π π¦ Want to compute WP ππ = π ππΎ = π π ππ π½ Y~U V π π¦ ππ ] π P π¦ π π¦ ππ¦ = ] π π¦ ππ π P π¦ ππ¦ ^ ^ π 1 ππ π P π¦ β π π π ππ log π P π¦ = ππ π P π¦ = π P π¦ ππ log π P π¦ π P π¦ ππΎ π π ππ = ] π π¦ π P π¦ ππ log π P π¦ ππ¦ = π½ Y~U V π π¦ ππ log π P π¦ ^ Approximate the expectation via sampling! Justin Johnson December 4, 2019 Lecture 21 - 65
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 66
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 67
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Transition probabilities of environment. We canβt compute this. Justin Johnson December 4, 2019 Lecture 21 - 68
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Transition probabilities Action probabilities of environment. We of policy. We can canβt compute this. are learning this! Justin Johnson December 4, 2019 Lecture 21 - 69
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Transition probabilities Action probabilities π π ππ log π P π¦ = > ππ log π P π + |π‘ + of environment. We of policy. We can canβt compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 70
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π P π¦ = d π π‘ +23 | π‘ + π P π + | π‘ + β log π P (π¦) = > log π π‘ +23 |π‘ + + log π P π + |π‘ + +?. +?. Transition probabilities Action probabilities π π ππ log π P π¦ = > ππ log π P π + |π‘ + of environment. We of policy. We can canβt compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 71
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π π ππ log π P π¦ = > ππ log π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 72
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π π Expected reward under π P : ππ log π P π¦ = > ππ log π P π + |π‘ + πΎ π = π½ Y~U V π π¦ +?. ππΎ π π ππ = π½ Y~U V π π¦ ππ log π P π¦ = π½ Y~U V π π¦ > ππ log π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 73
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ π π Expected reward under π P : ππ log π P π¦ = > ππ log π P π + |π‘ + πΎ π = π½ Y~U V π π¦ +?. ππΎ π π ππ = π½ Y~U V π π¦ ππ log π P π¦ = π½ Y~U V π π¦ > ππ log π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 74
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ Expected reward under π P : πΎ π = π½ Y~U V π π¦ ππΎ π ππ = π½ Y~U V π π¦ > ππ log π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 75
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ Expected reward under π P : Sequence of states πΎ π = π½ Y~U V π π¦ and actions when ππΎ π ππ = π½ π~π πΎ π π¦ > ππ log π P π + |π‘ + following policy π πΎ +?. Justin Johnson December 4, 2019 Lecture 21 - 76
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ Expected reward under π P : Reward we get from πΎ π = π½ Y~U V π π¦ state sequence x ππΎ π ππ = π½ Y~U V π π > ππ log π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 77
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ Gradient of predicted Expected reward under π P : action scores with πΎ π = π½ Y~U V π π¦ respect to model ππΎ π weights. Backprop ππ = π½ Y~U V π π¦ > ππΎ πππ π πΎ π π |π π through model π πΎ ! +?. Justin Johnson December 4, 2019 Lecture 21 - 78
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ 1. Initialize random weights ΞΈ Expected reward under π P : 2. Collect trajectories x and πΎ π = π½ Y~U V π π¦ rewards f(x) using policy π P 3. Compute dJ/dΞΈ ππΎ π ππ = π½ Y~U V π π¦ > ππ πππ π P π + |π‘ + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 79
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ 1. Initialize random weights ΞΈ Expected reward under π P : 2. Collect trajectories x and πΎ π = π½ Y~U V π π¦ rewards f(x) using policy π P 3. Compute dJ/dΞΈ ππΎ π ππ = π½ Y~U V π π¦ > ππ πππ π P π + |π‘ + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 80
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ 1. Initialize random weights ΞΈ Expected reward under π P : 2. Collect trajectories x and πΎ π = π½ Y~U V π π¦ rewards f(x) using policy π P 3. Compute dJ/dΞΈ ππΎ π ππ = π½ Y~U V π π¦ > ππ πππ π P π + |π‘ + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 81
Policy Gradients: REINFORCE Algorithm Goal : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state Define : Let π¦ = π‘ . , π . , π‘ 3 , π 3 , β¦ be the sequence of states and actions we get when following policy π P . Itβs random: π¦~π P π¦ Intuition : Expected reward under π P : When f(x) is high: Increase the πΎ π = π½ Y~U V π π¦ probability of the actions we took. When f(x) is low: Decrease the ππΎ π probability of the actions we took. ππ = π½ Y~U V π π¦ > ππ πππ π P π + |π‘ + +?. Justin Johnson December 4, 2019 Lecture 21 - 82
So far: Q-Learning and Policy Gradients Q-Learning : Train network π P π‘, π to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC π (π‘ C , π C ; π) Where π ~π π‘, π , π‘ C ~π π‘, π π§ E,B,P = π½ D,EC π + πΏ max S π π‘, π = π π‘, π; π β π§ E,B,P Policy Gradients : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = π½ Y~U V π π¦ β +?. πΎ π = π½ Y~U V π π¦ WP πππ π P π + |π‘ + Justin Johnson December 4, 2019 Lecture 21 - 83
So far: Q-Learning and Policy Gradients Q-Learning : Train network π P π‘, π to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC π (π‘ C , π C ; π) Where π ~π π‘, π , π‘ C ~π π‘, π π§ E,B,P = π½ D,EC π + πΏ max S π π‘, π = π π‘, π; π β π§ E,B,P Policy Gradients : Train a network π P π π‘) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = π½ Y~U V π π¦ β +?. πΎ π = π½ Y~U V π π¦ WP πππ π P π + |π‘ + Improving policy gradients: Add baseline to reduce variance of gradient estimator Justin Johnson December 4, 2019 Lecture 21 - 84
Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, βReinforcement Learning: An Introductionβ, 1998; Degris et al, βModel-free reinforcement learning with continuous action in practiceβ, 2012; Mnih et al, βAsynchronous Methods for Deep Reinforcement Learningβ, ICML 2016 Model-Based : Learn a model of the worldβs state transition function π(π‘ +23 |π‘ + , π + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, βAlgorithms for Inverse Reinforcement Learningβ, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, βGenerative Adversarial Imitation Learningβ, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 85
Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, βReinforcement Learning: An Introductionβ, 1998; Degris et al, βModel-free reinforcement learning with continuous action in practiceβ, 2012; Mnih et al, βAsynchronous Methods for Deep Reinforcement Learningβ, ICML 2016 Model-Based : Learn a model of the worldβs state transition function π(π‘ +23 |π‘ + , π + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, βAlgorithms for Inverse Reinforcement Learningβ, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, βGenerative Adversarial Imitation Learningβ, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 86
Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, βReinforcement Learning: An Introductionβ, 1998; Degris et al, βModel-free reinforcement learning with continuous action in practiceβ, 2012; Mnih et al, βAsynchronous Methods for Deep Reinforcement Learningβ, ICML 2016 Model-Based : Learn a model of the worldβs state transition function π(π‘ +23 |π‘ + , π + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, βAlgorithms for Inverse Reinforcement Learningβ, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, βGenerative Adversarial Imitation Learningβ, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 87
Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, βReinforcement Learning: An Introductionβ, 1998; Degris et al, βModel-free reinforcement learning with continuous action in practiceβ, 2012; Mnih et al, βAsynchronous Methods for Deep Reinforcement Learningβ, ICML 2016 Model-Based : Learn a model of the worldβs state transition function π(π‘ +23 |π‘ + , π + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, βAlgorithms for Inverse Reinforcement Learningβ, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, βGenerative Adversarial Imitation Learningβ, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 88
Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, βReinforcement Learning: An Introductionβ, 1998; Degris et al, βModel-free reinforcement learning with continuous action in practiceβ, 2012; Mnih et al, βAsynchronous Methods for Deep Reinforcement Learningβ, ICML 2016 Model-Based : Learn a model of the worldβs state transition function π(π‘ +23 |π‘ + , π + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, βAlgorithms for Inverse Reinforcement Learningβ, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, βGenerative Adversarial Imitation Learningβ, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 89
Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, βMastering the game of Go with deep neural networks and tree searchβ, Nature 2016 Silver et al, βMastering the game of Go without human knowledgeβ, Nature 2017 Silver et al, βA general reinforcement learning algorithm that masters chess, shogi, and go through self-playβ, Science 2018 This image is CC0 public domain Schrittwieser et al, βMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelβ, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 90
Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, βMastering the game of Go with deep neural networks and tree searchβ, Nature 2016 Silver et al, βMastering the game of Go without human knowledgeβ, Nature 2017 Silver et al, βA general reinforcement learning algorithm that masters chess, shogi, and go through self-playβ, Science 2018 This image is CC0 public domain Schrittwieser et al, βMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelβ, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 91
Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, βMastering the game of Go with deep neural networks and tree searchβ, Nature 2016 Silver et al, βMastering the game of Go without human knowledgeβ, Nature 2017 Silver et al, βA general reinforcement learning algorithm that masters chess, shogi, and go through self-playβ, Science 2018 This image is CC0 public domain Schrittwieser et al, βMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelβ, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 92
Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, βMastering the game of Go with deep neural networks and tree searchβ, Nature 2016 Silver et al, βMastering the game of Go without human knowledgeβ, Nature 2017 Silver et al, βA general reinforcement learning algorithm that masters chess, shogi, and go through self-playβ, Science 2018 This image is CC0 public domain Schrittwieser et al, βMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelβ, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 93
Case Study: Playing Games November 2019: Lee Sedol announces retirement AlphaGo : (January 2016) - Used imitation learning + tree search + RL βWith the debut of AI - Beat 18-time world champion Lee Sedol in Go games, I've AlphaGo Zero (October 2017) realized that I'm not at - Simplified version of AlphaGo the top even if I become the number - No longer using imitation learning one through frantic - Beat (at the time) #1 ranked Ke Jie effortsβ Alpha Zero (December 2018) βEven if I become the - Generalized to other games: Chess and Shogi number one, there is MuZero (November 2019) an entity that cannot - Plans through a learned model of the game be defeatedβ Silver et al, βMastering the game of Go with deep neural networks and tree searchβ, Nature 2016 Silver et al, βMastering the game of Go without human knowledgeβ, Nature 2017 Quotes from: https://en.yna.co.kr/view/AEN20191127004800315 Silver et al, βA general reinforcement learning algorithm that masters chess, shogi, and go through self-playβ, Science 2018 Image of Lee Sedol is licensed under CC BY 2.0 Schrittwieser et al, βMastering Atari, Go, Chess and Shogi by Planning with a Learned Modelβ, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 94
More Complex Games StarCraft II: AlphaStar Dota 2 : OpenAI Five (April 2019) (October 2019) No paper, only a blog post: Vinyals et al, βGrandmaster https://openai.com/five/#how- level in StarCraft II using openai-five-works multi-agent reinforcement learningβ, Science 2018 Justin Johnson December 4, 2019 Lecture 21 - 95
Reinforcement Learning: Interacting With World Ac#on Agent Environment Reward Normally we use RL to train agents that interact with a (noisy, nondifferentiable) environment Justin Johnson December 4, 2019 Lecture 21 - 96
Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Justin Johnson December 4, 2019 Lecture 21 - 97
Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small βroutingβ network sends image to one of K networks CNN CNN CNN CNN Justin Johnson December 4, 2019 Lecture 21 - 98
Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small βroutingβ network sends image to one of K networks CNN Which network to use? P(orange) = 0.2 CNN CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 99
Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small βroutingβ network sends image to one of K networks CNN Which network to use? Sample: P(orange) = 0.2 CNN Green CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 100
Recommend
More recommend