Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Value Function and Q Function Following a policy 𝜌 produces sample trajectories (or paths) s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: 𝑊 < 𝑡 = 𝔽 > 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝜌 +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 𝑅 < 𝑡, 𝑏 = 𝔽 > 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝑏 . = 𝑏, 𝜌 +?. Justin Johnson December 4, 2019 Lecture 21 - 35

Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 ∗ 𝑡, 𝑏 = max 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Justin Johnson December 4, 2019 Lecture 21 - 36

Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 ∗ 𝑡, 𝑏 = max 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 ∗ 𝑡 = arg max BC 𝑅(𝑡, 𝑏 C ) Justin Johnson December 4, 2019 Lecture 21 - 37

Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 ∗ 𝑡, 𝑏 = max 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 ∗ 𝑡 = arg max BC 𝑅(𝑡, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 38

Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 ∗ 𝑡, 𝑏 = max 𝛿 + 𝑠 + | 𝑡 . = 𝑡, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 ∗ 𝑡 = arg max BC 𝑅(𝑡, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Intuition : After taking action a in state s, we get reward r and move to a new BC 𝑅 ∗ 𝑡 C , 𝑏′ state s’. After that, the max possible reward we can get is max Justin Johnson December 4, 2019 Lecture 21 - 39

Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Justin Johnson December 4, 2019 Lecture 21 - 40

Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑡 C , 𝑏′ 𝑅 H23 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 41

Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑡 C , 𝑏′ 𝑅 H23 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 → ∞ Justin Johnson December 4, 2019 Lecture 21 - 42

Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑡 C , 𝑏′ 𝑅 H23 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 → ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Justin Johnson December 4, 2019 Lecture 21 - 43

Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑡 C , 𝑏′ 𝑅 H23 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 → ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Solution : Approximate Q(s, a) with a neural network, use Bellman Equation as loss! Justin Johnson December 4, 2019 Lecture 21 - 44

Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Train a neural network (with weights θ) to approximate Q * : 𝑅 ∗ 𝑡, 𝑏 ≈ 𝑅 𝑡, 𝑏; 𝜄 Justin Johnson December 4, 2019 Lecture 21 - 45

Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Train a neural network (with weights θ) to approximate Q * : 𝑅 ∗ 𝑡, 𝑏 ≈ 𝑅 𝑡, 𝑏; 𝜄 Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 Justin Johnson December 4, 2019 Lecture 21 - 46

Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Train a neural network (with weights θ) to approximate Q * : 𝑅 ∗ 𝑡, 𝑏 ≈ 𝑅 𝑡, 𝑏; 𝜄 Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑡, 𝑏 = 𝑅 𝑡, 𝑏; 𝜄 − 𝑧 E,B,P Justin Johnson December 4, 2019 Lecture 21 - 47

Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Train a neural network (with weights θ) to approximate Q * : 𝑅 ∗ 𝑡, 𝑏 ≈ 𝑅 𝑡, 𝑏; 𝜄 Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑡, 𝑏 = 𝑅 𝑡, 𝑏; 𝜄 − 𝑧 E,B,P Problem : Nonstationary! The “target” for Q(s, a) depends on the current weights θ! Justin Johnson December 4, 2019 Lecture 21 - 48

Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 ∗ 𝑡, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 ∗ 𝑡 C , 𝑏′ Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄(𝑡, 𝑏) Train a neural network (with weights θ) to approximate Q * : 𝑅 ∗ 𝑡, 𝑏 ≈ 𝑅 𝑡, 𝑏; 𝜄 Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑡, 𝑏 = 𝑅 𝑡, 𝑏; 𝜄 − 𝑧 E,B,P Problem : Nonstationary! The “target” for Q(s, a) depends on the current weights θ! Problem : How to sample batches of data for training? Justin Johnson December 4, 2019 Lecture 21 - 49

Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, “Playing Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson December 4, 2019 Lecture 21 - 50

Mnih et al, “Playing Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Case Study: Playing Atari Games Network output : Q-values for all actions With 4 actions: last 𝑅 𝑡, 𝑏; 𝜄 layer gives values FC-A (Q-values) Neural network Q(s t , a 1 ), Q(s t , a 2 ), FC-256 with weights θ Q(s t , a 3 ), Q(s t ,a 4 ) Conv(16->32, 4x4, stride 2) Conv(4->16, 8x8, stride 4) Network input: state s t : 4x84x84 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Justin Johnson December 4, 2019 Lecture 21 - 51

https://www.youtube.com/watch?v=V1eYniJ0Rnk Justin Johnson December 4, 2019 Lecture 21 - 52

Q-Learning Q-Learning : Train network 𝑅 P 𝑡, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Justin Johnson December 4, 2019 Lecture 21 - 53

Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑡, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Justin Johnson December 4, 2019 Lecture 21 - 54

Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑡, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 𝜄 = 𝔽 D~U V > +?. Find the optimal policy by maximizing: 𝜄 ∗ = arg max P 𝐾 𝜄 (Use gradient ascent!) Justin Johnson December 4, 2019 Lecture 21 - 55

Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 𝜄 = 𝔽 D~U V > +?. Find the optimal policy by maximizing: 𝜄 ∗ = arg max P 𝐾 𝜄 (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP Justin Johnson December 4, 2019 Lecture 21 - 56

Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 𝜄 = 𝔽 D~U V > +?. Find the optimal policy by maximizing: 𝜄 ∗ = arg max P 𝐾 𝜄 (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 57

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 58

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP 𝜖𝜄 = 𝜖 𝜖𝐾 = 𝜖 𝜖 𝜖𝜄 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 ] 𝑞 P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 𝜖𝜄 𝑞 P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 59

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP 𝜖𝜄 = 𝜖 𝜖𝐾 = 𝜖 𝜖 𝜖𝜄 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 ] 𝑞 P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 𝜖𝜄 𝑞 P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 60

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP 𝜖𝜄 = 𝜖 𝜖𝐾 = 𝜖 𝜖 𝜖𝜄 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 ] 𝑞 P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 𝜖𝜄 𝑞 P 𝑦 𝑒𝑦 ^ ^ 𝜖 1 𝜖𝜄 𝑞 P 𝑦 ⇒ 𝜖 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = 𝜖𝜄 𝑞 P 𝑦 = 𝑞 P 𝑦 𝜖𝜄 log 𝑞 P 𝑦 𝑞 P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 61

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP 𝜖𝜄 = 𝜖 𝜖𝐾 = 𝜖 𝜖 𝜖𝜄 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 ] 𝑞 P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 𝜖𝜄 𝑞 P 𝑦 𝑒𝑦 ^ ^ 𝜖 1 𝜖𝜄 𝑞 P 𝑦 ⇒ 𝜖 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = 𝜖𝜄 𝑞 P 𝑦 = 𝑞 P 𝑦 𝜖𝜄 log 𝑞 P 𝑦 𝑞 P 𝑦 𝜖𝐾 𝜖 𝜖 𝜖𝜄 = ] 𝑔 𝑦 𝑞 P 𝑦 𝜖𝜄 log 𝑞 P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 log 𝑞 P 𝑦 ^ Justin Johnson December 4, 2019 Lecture 21 - 64

Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP 𝜖𝜄 = 𝜖 𝜖𝐾 = 𝜖 𝜖 𝜖𝜄 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 ] 𝑞 P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 𝜖𝜄 𝑞 P 𝑦 𝑒𝑦 ^ ^ 𝜖 1 𝜖𝜄 𝑞 P 𝑦 ⇒ 𝜖 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = 𝜖𝜄 𝑞 P 𝑦 = 𝑞 P 𝑦 𝜖𝜄 log 𝑞 P 𝑦 𝑞 P 𝑦 𝜖𝐾 𝜖 𝜖 𝜖𝜄 = ] 𝑔 𝑦 𝑞 P 𝑦 𝜖𝜄 log 𝑞 P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 log 𝑞 P 𝑦 ^ Approximate the expectation via sampling! Justin Johnson December 4, 2019 Lecture 21 - 65

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 66

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 67

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Transition probabilities of environment. We can’t compute this. Justin Johnson December 4, 2019 Lecture 21 - 68

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Transition probabilities Action probabilities of environment. We of policy. We can can’t compute this. are learning this! Justin Johnson December 4, 2019 Lecture 21 - 69

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Transition probabilities Action probabilities 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 70

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝑞 P 𝑦 = d 𝑄 𝑡 +23 | 𝑡 + 𝜌 P 𝑏 + | 𝑡 + ⇒ log 𝑞 P (𝑦) = > log 𝑄 𝑡 +23 |𝑡 + + log 𝜌 P 𝑏 + |𝑡 + +?. +?. Transition probabilities Action probabilities 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 71

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝜖 𝜖 𝜖𝜄 log 𝑞 P 𝑦 = > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 72

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝜖 𝜖 Expected reward under 𝜌 P : 𝜖𝜄 log 𝑞 P 𝑦 = > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 +?. 𝜖𝐾 𝜖 𝜖 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 log 𝑞 P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 73

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 𝜖 𝜖 Expected reward under 𝜌 P : 𝜖𝜄 log 𝑞 P 𝑦 = > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 +?. 𝜖𝐾 𝜖 𝜖 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 𝜖𝜄 log 𝑞 P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 74

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 Expected reward under 𝜌 P : 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 𝜖𝐾 𝜖 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 75

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 Expected reward under 𝜌 P : Sequence of states 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 and actions when 𝜖𝐾 𝜖 𝜖𝜄 = 𝔽 𝒚~𝒒 𝜾 𝑔 𝑦 > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + following policy 𝝆 𝜾 +?. Justin Johnson December 4, 2019 Lecture 21 - 76

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 Expected reward under 𝜌 P : Reward we get from 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 state sequence x 𝜖𝐾 𝜖 𝜖𝜄 = 𝔽 Y~U V 𝒈 𝒚 > 𝜖𝜄 log 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 77

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 Gradient of predicted Expected reward under 𝜌 P : action scores with 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 respect to model 𝜖𝐾 𝝐 weights. Backprop 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 > 𝝐𝜾 𝒎𝒑𝒉 𝝆 𝜾 𝒃 𝒖 |𝒕 𝒖 through model 𝝆 𝜾 ! +?. Justin Johnson December 4, 2019 Lecture 21 - 78

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 1. Initialize random weights θ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dθ 𝜖𝐾 𝜖 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 > 𝜖𝜄 𝑚𝑝𝑕 𝜌 P 𝑏 + |𝑡 + 4. Gradient ascent step on θ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 79

Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑡 . , 𝑏 . , 𝑡 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~𝑞 P 𝑦 Intuition : Expected reward under 𝜌 P : When f(x) is high: Increase the 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 probability of the actions we took. When f(x) is low: Decrease the 𝜖𝐾 𝜖 probability of the actions we took. 𝜖𝜄 = 𝔽 Y~U V 𝑔 𝑦 > 𝜖𝜄 𝑚𝑝𝑕 𝜌 P 𝑏 + |𝑡 + +?. Justin Johnson December 4, 2019 Lecture 21 - 82

So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑡, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑡, 𝑏 = 𝑅 𝑡, 𝑏; 𝜄 − 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 ∑ +?. 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 WP 𝑚𝑝𝑕 𝜌 P 𝑏 + |𝑡 + Justin Johnson December 4, 2019 Lecture 21 - 83

So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑡, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑡 C , 𝑏 C ; 𝜄) Where 𝑠~𝑆 𝑡, 𝑏 , 𝑡 C ~𝑄 𝑡, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑡, 𝑏 = 𝑅 𝑡, 𝑏; 𝜄 − 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑡) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 ∑ +?. 𝐾 𝜄 = 𝔽 Y~U V 𝑔 𝑦 WP 𝑚𝑝𝑕 𝜌 P 𝑏 + |𝑡 + Improving policy gradients: Add baseline to reduce variance of gradient estimator Justin Johnson December 4, 2019 Lecture 21 - 84

Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, “Reinforcement Learning: An Introduction”, 1998; Degris et al, “Model-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑡 +23 |𝑡 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, “Algorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, “Generative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 85

Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017 Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 90

Case Study: Playing Games November 2019: Lee Sedol announces retirement AlphaGo : (January 2016) - Used imitation learning + tree search + RL “With the debut of AI - Beat 18-time world champion Lee Sedol in Go games, I've AlphaGo Zero (October 2017) realized that I'm not at - Simplified version of AlphaGo the top even if I become the number - No longer using imitation learning one through frantic - Beat (at the time) #1 ranked Ke Jie efforts” Alpha Zero (December 2018) “Even if I become the - Generalized to other games: Chess and Shogi number one, there is MuZero (November 2019) an entity that cannot - Plans through a learned model of the game be defeated” Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017 Quotes from: https://en.yna.co.kr/view/AEN20191127004800315 Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 Image of Lee Sedol is licensed under CC BY 2.0 Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 94

More Complex Games StarCraft II: AlphaStar Dota 2 : OpenAI Five (April 2019) (October 2019) No paper, only a blog post: Vinyals et al, “Grandmaster https://openai.com/five/#how- level in StarCraft II using openai-five-works multi-agent reinforcement learning”, Science 2018 Justin Johnson December 4, 2019 Lecture 21 - 95

Reinforcement Learning: Interacting With World Ac#on Agent Environment Reward Normally we use RL to train agents that interact with a (noisy, nondifferentiable) environment Justin Johnson December 4, 2019 Lecture 21 - 96

Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Justin Johnson December 4, 2019 Lecture 21 - 97

Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small “routing” network sends image to one of K networks CNN CNN CNN CNN Justin Johnson December 4, 2019 Lecture 21 - 98

Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small “routing” network sends image to one of K networks CNN Which network to use? P(orange) = 0.2 CNN CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 99

Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small “routing” network sends image to one of K networks CNN Which network to use? Sample: P(orange) = 0.2 CNN Green CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 100

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 Lecture 21 - 1 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson December 4, 2019 Lecture 21 - 2

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 11:

Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Our Status in CS188 CS 188: Artificial Intelligence We re done with Part I Search and

Mobile Money Global Event Wednesday October 7, 2015 Cape Town, South Africa #MMGE15

Cellular automata and agent-based models Matthew Macauley Department of Mathematical Sciences

Clocks Adam T. Sampson School of Computing, University of Kent Neil C. C. Brown School of

Child and Adult Care Food Program (CACFP) ADMINISTRATIVE & PURCHASING FUNCTIONS FY 2021 1

CSE 473: Artificial Intelligence Hidden Markov Models Daniel Weld University of Washington

DEFENSE LOGISTICS AGENCY AMERICA S COMBAT LOGISTICS SUPPORT AGENCY DLA Troop Support -

Like NSA Pre-auth RCE on Leading SSL VPNs Orange Tsai (@orange_8361) Meh Chang (@mehqq_) Orange

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 Lecture 21 - 1 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson December 4, 2019 Lecture 21 - 2

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 11:

Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Our Status in CS188 CS 188: Artificial Intelligence We re done with Part I Search and

Mobile Money Global Event Wednesday October 7, 2015 Cape Town, South Africa #MMGE15

Cellular automata and agent-based models Matthew Macauley Department of Mathematical Sciences

Clocks Adam T. Sampson School of Computing, University of Kent Neil C. C. Brown School of

Child and Adult Care Food Program (CACFP) ADMINISTRATIVE &amp; PURCHASING FUNCTIONS FY 2021 1

CSE 473: Artificial Intelligence Hidden Markov Models Daniel Weld University of Washington

DEFENSE LOGISTICS AGENCY AMERICA S COMBAT LOGISTICS SUPPORT AGENCY DLA Troop Support -

Like NSA Pre-auth RCE on Leading SSL VPNs Orange Tsai (@orange_8361) Meh Chang (@mehqq_) Orange

Child and Adult Care Food Program (CACFP) ADMINISTRATIVE & PURCHASING FUNCTIONS FY 2021 1