cs 287 lecture 18 fall 2019 rl i policy gradients
play

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel - PowerPoint PPT Presentation

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline for Todays Lecture Super-quick Refresher: Markov Model-free Policy


  1. Likelihood Ratio Policy Gradient [Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & Bartlett, 2001]

  2. Likelihood Ratio Gradient: Validity m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 Valid even when n n R is discontinuous and/or unknown n Sample space (of paths) is a discrete set

  3. Likelihood Ratio Gradient: Intuition m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with positive R n Decrease probability of paths with negative R ! Likelihood ratio changes probabilities of experienced paths, does not try to change the paths (<-> Path Derivative)

  4. Let’s Decompose Path into States and Actions

  5. Let’s Decompose Path into States and Actions

  6. Let’s Decompose Path into States and Actions

  7. Let’s Decompose Path into States and Actions

  8. Likelihood Ratio Gradient Estimate

  9. Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n

  10. Derivation from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]

  11. Derivation from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]

  12. Derivation from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]

  13. Derivation from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NeurIPS 2011]

  14. Derivation from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Suggests we can also look at more than just gradient! E.g., can use importance sampled objective as “surrogate loss” (locally) [[ à later: PPO]] [Tang&Abbeel, NeurIPS 2011]

  15. Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction and temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n

  16. Likelihood Ratio Gradient Estimate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world practicality n Baseline n Temporal structure n [later] Trust region / natural gradient

  17. Likelihood Ratio Gradient: Intuition m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with positive R n Decrease probability of paths with negative R ! Likelihood ratio changes probabilities of experienced paths, does not try to change the paths (<-> Path Derivative)

  18. Likelihood Ratio Gradient: Baseline m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 m g = 1 à Consider baseline b: X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) r U ( θ ) ⇡ ˆ m i =1 E [ r θ log P ( τ ; θ ) b ] X = P ( τ ; θ ) r θ log P ( τ ; θ ) b τ OK as long as baseline P ( τ ; θ ) r θ P ( τ ; θ ) X = P ( τ ; θ ) b doesn’t depend on action still unbiased! τ in logprob(action) [Williams 1992] X = r θ P ( τ ; θ ) b τ X ! = r θ P ( τ ) b X = b r θ ( P ( τ )) = b ⇥ 0 τ τ = r θ ( b ) =0

  19. Likelihood Ratio and Temporal Structure Current estimate: m g = 1 n X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 H − 1 ! H − 1 ! m = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 m H − 1 " � t − 1 � H − 1 # ! = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) R ( s ( i ) k , u ( i ) X X X X � � t ) k ) + k ) � b m i =1 t =0 k =0 k = t Doesn’t depend on u ( i ) Ok to depend on s ( i ) t t Removing terms that don’t depend on current action can lower variance: n m H − 1 H − 1 ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � b ( s ( i ) X X X t ) t ) m i =1 t =0 k = t [Policy Gradient Theorem: Sutton et al, NIPS 1999; GPOMDP: Bartlett & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006]

  20. Baseline Choices Good choice for b? n m b = E [ R ( τ )] ≈ 1 n Constant baseline: X R ( τ ( i ) ) m i =1 n Optimal Constant baseline: m H − 1 n Time-dependent baseline: b t = 1 R ( s ( i ) k , u ( i ) X X k ) m i =1 k = t n State-dependent expected return: b ( s t ) = E [ r t + r t +1 + r t +2 + . . . + r H − 1 ] = V π ( s t ) à Increase logprob of action proportionally to how much its returns are better than the expected return under the current policy [See: Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]

  21. Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n

  22. Monte Carlo Estimation of V π H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t How to estimate? V π Init n φ 0 n Collect trajectories τ 1 , . . . , τ m n Regress against empirical return: ! 2 H − 1 � H − 1 m 1 θ ( s ( i ) R ( s ( i ) k , u ( i ) X X X V π � φ i +1 ← arg min t ) − k ) m φ i =1 t =0 k = t

  23. Bootstrap Estimation of V π Bellman Equation for V π n X X P ( s 0 | s, u )[ R ( s, u, s 0 ) + γ V π ( s 0 )] V π ( s ) = π ( u | s ) s 0 u V π Init n φ 0 n Collect data {s, u, s’, r} n Fitted V iteration: X φ i ( s 0 ) � V φ ( s ) k 2 2 + λ k φ � φ i k 2 k r + V π φ i +1 min 2 φ ( s,u,s 0 ,r )

  24. Vanilla Policy Gradient ~ [Williams, 1992]

  25. Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n

  26. Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  27. Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  28. Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization used n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  29. Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  30. Further Refinements H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  31. Recall Our Likelihood Ratio PG Estimator H − 1 H − 1 ! m 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � V π ( s ( i ) X X X t ) k ) m i =1 t =0 k = t n Estimation of Q from single roll-out Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] n = high variance per sample based / no generalization n Reduce variance by discounting n Reduce variance by function approximation (=critic)

  32. Variance Reduction by Discounting Q π ( s, u ) = E [ r 0 + r 1 + r 2 + · · · | s 0 = s, a 0 = a ] à introduce discount factor as a hyperparameter to improve estimate of Q: Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, a 0 = a ]

  33. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)

  34. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)

  35. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · n Generalized Advantage Estimation uses an exponentially weighted average of these n ~ TD(lambda)

  36. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] = · · · Async Advantage Actor Critic (A3C) [Mnih et al, 2016] n ˆ Q one of the above choices (e.g. k=5 step lookahead) n

  37. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] (1 − λ ) = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] (1 − λ ) λ = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 2 = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 3 = · · · Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016] n ˆ Q = lambda exponentially weighted average of all the above n ~ TD(lambda) / eligibility traces [Sutton and Barto, 1990] n

  38. Reducing Variance by Function Approximation Q π , γ ( s, u ) = E [ r 0 + γ r 1 + γ 2 r 2 + · · · | s 0 = s, u 0 = u ] (1 − λ ) = E [ r 0 + γ V π ( s 1 ) | s 0 = s, u 0 = u ] (1 − λ ) λ = E [ r 0 + γ r 1 + γ 2 V π ( s 2 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 2 = E [ r 0 + γ r 1 + + γ 2 r 2 + γ 3 V π ( s 3 ) | s 0 = s, u 0 = u ] (1 − λ ) λ 3 = · · · Generalized Advantage Estimation (GAE) [Schulman et al, ICLR 2016] n ˆ Q = lambda exponentially weighted average of all the above n ~ TD(lambda) / eligibility traces [Sutton and Barto, 1990] n

  39. Actor-Critic with A3C or GAE Policy Gradient + Generalized Advantage Estimation: n V π Init π θ 0 n φ 0 ˆ Collect roll-outs {s, u, s’, r} and Q i ( s, u ) n Update: X k ˆ Q i ( s, u ) � V π φ ( s ) k 2 2 + κ k φ � φ i k 2 φ i +1 min n 2 φ ( s,u,s 0 ,r ) m H − 1 θ i +1 θ i + α 1 ⇣ ⌘ r θ log π θ i ( u ( k ) | s ( k ) Q i ( s ( k ) , u ( k ) φ i ( s ( k ) X X ˆ ) � V π ) ) t t t t t m t =0 k =1 Note: many variations, e.g. could instead use 1-step for V, full roll-out for pi: X φ i ( s 0 ) � V φ ( s ) k 2 2 + λ k φ � φ i k 2 k r + V π φ i +1 min 2 φ ( s,u,s 0 ,r ) m H − 1 H − 1 ! 1 r θ log π θ i ( u ( k ) | s ( k ) r ( k ) φ i ( s ( k ) X X X � V π θ i +1 θ i + α ) t 0 ) t t t 0 m t =0 t 0 = t k =1

  40. Async Advantage Actor Critic (A3C) n [Mnih et al, ICML 2016] n Likelihood Ratio Policy Gradient n n-step Advantage Estimation

  41. A3C -- labyrinth

  42. Example: Toddler Robot [Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]

  43. GAE: Effect of gamma and lambda [Schulman et al, 2016 -- GAE]

  44. Outline for Today’s Lecture Super-quick Refresher: Markov Model-free Policy Optimization: Policy Gradients n n Decision Processes (MDPs) Policy Gradient standard derivation n Reinforcement Learning n Temporal decomposition n Policy Optimization Policy Gradient importance sampling n n derivation Model-free Policy Optimization: Finite n Baseline subtraction & temporal structure Differences n Value function estimation n Model-free Policy Optimization: Cross- n Entropy Method Advantage Estimation (A2C/A3C/GAE) n Trust Region Policy Optimization (TRPO) n Proximal Policy Optimization (PPO) n

  45. Step-sizing and Trust Regions n Step-sizing necessary as gradient is only first-order approximation

  46. What’s in a step-size? n Terrible step sizes, always an issue, but how about just not so great ones? n Supervised learning n Step too far à next update will correct for it n Reinforcement learning n Step too far à terrible policy n Next mini-batch: collected under this terrible policy! n Not clear how to recover short of going back and shrinking the step size

  47. Step-sizing and Trust Regions n Simple step-sizing: Line search in direction of gradient n Simple, but expensive (evaluations along the line) n Naïve: ignores where the first-order approximation is good/poor

  48. Step-sizing and Trust Regions n Advanced step-sizing: Trust regions n First-order approximation from gradient is a good approximation within “trust region” à Solve for best point within trust region: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε

  49. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ

  50. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ

  51. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ

  52. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ

  53. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε H − 1 n Recall: Y P ( τ ; θ ) = P ( s 0 ) π θ ( u t | s t ) P ( s t +1 | s t , u t ) t =0 n Hence: P ( τ ; θ ) X KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) = P ( τ ; θ ) log P ( τ ; θ + δθ ) τ P ( s 0 ) Q H − 1 t =0 π θ ( u t | s t ) P ( s t +1 | s t , u t ) X = P ( τ ; θ ) log P ( s 0 ) Q H − 1 t =0 π θ + δθ ( u t | s t ) P ( s t +1 | s t , u t ) τ Q H − 1 t =0 π θ ( u t | s t ) dynamics cancels out! J X = P ( τ ; θ ) log Q H − 1 t =0 π θ + δθ ( u t | s t ) τ ≈ 1 π θ ( u | s ) X π θ ( u | s ) log ≈ 1 π θ ( u | s ) X π θ + δθ ( u | s ) M log π θ + δθ ( u | s ) M ( s,u ) in roll − outs under θ s , u in roll − outs under θ = 1 X KL ( π θ ( u | s ) || π θ + δθ ( u | s )) M ( s,u ) ∼ θ

  54. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ

  55. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n How to enforce this constraint given complex policies like neural nets n 2 nd approximation of KL Divergence n (1) First order approximation is constant n (2) Hessian is Fisher Information Matrix

  56. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n 2 nd order approximation to KL: 0 1 @ X A δθ KL ( π θ ( u | s ) || π θ + δθ ( u | s ) ⇡ δθ > r θ log π θ ( u | s ) r θ log π θ ( u | s ) > ( s,u ) ⇠ θ = δθ > F θ δθ

  57. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . KL ( P ( τ ; θ ) || P ( τ ; θ + δθ )) ≤ ε n Has become: g > δθ max ˆ δθ 1 π θ ( u | s ) X s . t . log π θ + δθ ( u | s ) ≤ ε M (s , u) ⇠ θ n 2 nd order approximation to KL: 0 1 @ X A δθ KL ( π θ ( u | s ) || π θ + δθ ( u | s ) ⇡ δθ > r θ log π θ ( u | s ) r θ log π θ ( u | s ) > ( s,u ) ⇠ θ = δθ > F θ δθ à Fisher matrix easily computed from gradient calculations F θ

  58. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used

  59. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used

  60. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used

  61. Evaluating the KL n Our problem: g > δθ max ˆ δθ s . t . δθ > F θ δθ ≤ ε n Done? n Deep RL à high-dimensional, and building / inverting impractical θ F θ n Efficient scheme through conjugate gradient [Schulman et al, 2015, TRPO] n Can we do better? n Replace objective by surrogate loss that’s higher order approximation yet equally efficient to evaluate [Schulman et al, 2015, TRPO] n Note: surrogate loss idea is generally applicable when likelihood ratio gradients are used

Recommend


More recommend