Likelihood Ratio Policy Gradient [Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]
Likelihood Ratio Gradient: Validity m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 Valid even when n n R is discontinuous and/or unknown n Sample space (of paths) is a discrete set
Likelihood Ratio Gradient: Intuition m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with positive R n Decrease probability of paths with negative R ! Likelihood ratio changes probabilities of experienced paths, does not try to change the paths (<-> Path Derivative)
Let’s Decompose Path into States and Actions
Let’s Decompose Path into States and Actions
Let’s Decompose Path into States and Actions
Let’s Decompose Path into States and Actions
Likelihood Ratio Gradient Estimate
Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n
Derivation from Importance Sampling P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]
Derivation from Importance Sampling P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]
Derivation from Importance Sampling P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]
Derivation from Importance Sampling P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]
Derivation from Importance Sampling P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ ) r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Suggests we can also look at more than just gradient! E.g., can use importance sampled objective as “surrogate loss” (locally) [[ à later: PPO]] [Tang&Abbeel, NeurIPS 2011]
Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction and temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n
Likelihood Ratio Gradient Estimate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world practicality n Baseline n Temporal structure n [later] Trust region / natural gradient
Likelihood Ratio Gradient: Intuition m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with positive R n Decrease probability of paths with negative R ! Likelihood ratio changes probabilities of experienced paths, does not try to change the paths (<-> Path Derivative)
Likelihood Ratio Gradient: Baseline m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 m g = 1 à Consider baseline b: X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) r U ( θ ) ⇡ ˆ m i =1 E [ r θ log P ( τ ; θ ) b ] X = P ( τ ; θ ) r θ log P ( τ ; θ ) b τ OK as long as baseline P ( τ ; θ ) r θ P ( τ ; θ ) X = P ( τ ; θ ) b doesn’t depend on action still unbiased! τ in logprob(action) [Williams 1992] X = r θ P ( τ ; θ ) b τ X ! = r θ P ( τ ) b X = b r θ ( P ( τ )) = b ⇥ 0 τ τ = r θ ( b ) =0
Likelihood Ratio and Temporal Structure Current estimate: m g = 1 n X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 H − 1 ! H − 1 ! m = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 m H − 1 " � t − 1 � H − 1 # ! = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) R ( s ( i ) k , u ( i ) X X X X � � t ) k ) + k ) � b m i =1 t =0 k =0 k = t Doesn’t depend on u ( i ) Ok to depend on s ( i ) t t Removing terms that don’t depend on current action can lower variance: n m H − 1 H − 1 ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � b ( s ( i ) X X X t ) t ) m i =1 t =0 k = t [Policy Gradient Theorem: Sutton et al, NIPS 1999; GPOMDP: Bartlett & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006]
Baseline Choices Good choice for b? n m b = E [ R ( τ )] ≈ 1 n Constant baseline: X R ( τ ( i ) ) m i =1 n Optimal Constant baseline: m H − 1 n Time-dependent baseline: b t = 1 R ( s ( i ) k , u ( i ) X X k ) m i =1 k = t n State-dependent expected return: b ( s t ) = E [ r t + r t +1 + r t +2 + . . . + r H − 1 ] = V π ( s t ) à Increase logprob of action proportionally to how much its returns are better than the expected return under the current policy [See: Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]
Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n
Monte Carlo Estimation of V π H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t How to estimate? V π Init n φ 0 n Collect trajectories τ 1 , . . . , τ m n Regress against empirical return: ! 2 H − 1 � H − 1 m 1 θ ( s ( i ) R ( s ( i ) k , u ( i ) X X X V π � φ i +1 ← arg min t ) − k ) m φ i =1 t =0 k = t
Bootstrap Estimation of V π Bellman Equation for V π n X X P ( s 0 | s, u )[ R ( s, u, s 0 ) + γ V π ( s 0 )] V π ( s ) = π ( u | s ) s 0 u V π Init n φ 0 n Collect data {s, u, s’, r} n Fitted V iteration: X φ i ( s 0 ) � V φ ( s ) k 2 2 + λ k φ � φ i k 2 k r + V π φ i +1 min 2 φ ( s,u,s 0 ,r )
Vanilla Policy Gradient ~ [Williams, 1992]
Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n
Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Further Refinements H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)
Variance Reduction by Discounting Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] à introduce discount factor as a hyperparameter to improve estimate of Q: Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, a 0 = a ]
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · Async Advantage Actor Critic (A3C) [Mnih et al, 2016] n ˆ Q one of the above choices (e.g. k=5 step lookahead) n
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] (1 − λ ) = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] (1 − λ ) λ = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 2 = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 3 = · · · Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016] n ˆ Q = lambda exponentially weighted average of all the above n ~ TD(lambda) / eligibility traces [Sutton and Barto, 1990] n
Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] (1 − λ ) = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] (1 − λ ) λ = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 2 = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 3 = · · · Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016] n ˆ Q = lambda exponentially weighted average of all the above n ~ TD(lambda) / eligibility traces [Sutton and Barto, 1990] n
Actor-Critic with A3C or GAE Policy Gradient + Generalized Advantage Estimation: n V π Init π θ 0 n φ 0 ˆ Collect roll-outs {s, u, s’, r} and Q i ( s, u ) n Update: X k ˆ Q i ( s, u ) � V π φ ( s ) k 2 2 + κ k φ � φ i k 2 φ i +1 min n 2 φ ( s,u,s 0 ,r ) m H − 1 θ i +1 θ i + α 1 ⇣ ⌘ r θ log π θ i ( u ( k ) | s ( k ) Q i ( s ( k ) , u ( k ) φ i ( s ( k ) X X ˆ ) � V π ) ) t t t t t m t =0 k =1 Note: many variations, e.g. could instead use 1-step for V, full roll-out for pi: X φ i ( s 0 ) � V φ ( s ) k 2 2 + λ k φ � φ i k 2 k r + V π φ i +1 min 2 φ ( s,u,s 0 ,r ) m H − 1 H − 1 ! 1 r θ log π θ i ( u ( k ) | s ( k ) r ( k ) φ i ( s ( k ) X X X � V π θ i +1 θ i + α ) t 0 ) t t t 0 m t =0 t 0 = t k =1
Async Advantage Actor Critic (A3C) n [Mnih et al, ICML 2016] n Likelihood Ratio Policy Gradient n n-step Advantage Estimation
A3C -- labyrinth
Example: Toddler Robot [Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]
GAE: Effect of gamma and lambda [Schulman et al, 2016 -- GAE]
Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n
Step-sizing and Trust Regions n Step-sizing necessary as gradient is only first-order approximation
What’s in a step-size? n Terrible step sizes, always an issue, but how about just not so great ones? n Supervised learning n Step too far à next update will correct for it n Reinforcement learning n Step too far à terrible policy n Next mini-batch: collected under this terrible policy! n Not clear how to recover short of going back and shrinking the step size
Step-sizing and Trust Regions n Simple step-sizing: Line search in direction of gradient n Simple, but expensive (evaluations along the line) n Naïve: ignores where the first-order approximation is good/poor
Step-sizing and Trust Regions n Advanced step-sizing: Trust regions n First-order approximation from gradient is a good approximation within “trust region” à Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log ≈ 1 π θ ( u | s ) X π θ + δθ ( u | s ) M log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ s , u in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n How to enforce this constraint given complex policies like neural nets n 2 nd approximation of KL Divergence n (1) First order approximation is constant n (2) Hessian is Fisher Information Matrix
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n 2 nd order approximation to KL: 0 1 @ X A δθ KL ( π θ ( u | s ) || π θ + δθ ( u | s ) ⇡ δθ > r θ log π θ ( u | s ) r θ log π θ ( u | s ) > ( s,u ) ⇠ θ = δθ > F θ δθ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n 2 nd order approximation to KL: 0 1 @ X A δθ KL ( π θ ( u | s ) || π θ + δθ ( u | s ) ⇡ δθ > r θ log π θ ( u | s ) r θ log π θ ( u | s ) > ( s,u ) ⇠ θ = δθ > F θ δθ à Fisher matrix easily computed from gradient calculations F θ
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used
Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used
Recommend
More recommend