deep reinforcement learning through policy op7miza7on
play

Deep Reinforcement Learning through Policy Op7miza7on Pieter - PowerPoint PPT Presentation

Deep Reinforcement Learning through Policy Op7miza7on Pieter Abbeel John Schulman Open AI / Berkeley AI Research Lab Reinforcement Learning u t [Figure source: SuBon & Barto, 1998] John Schulman & Pieter Abbeel OpenAI + UC


  1. DerivaMon from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NIPS 2011] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  2. DerivaMon from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Note: Suggests we can also look at more than just gradient! [Tang&Abbeel, NIPS 2011] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  3. DerivaMon from Importance Sampling  P ( τ | θ ) � U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) � r θ U ( θ ) = E τ ∼ θ old P ( τ | θ old ) R ( τ )  r θ P ( τ | θ ) | θ old � r θ U ( θ ) | θ = θ old = E τ ∼ θ old R ( τ ) P ( τ | θ old ) ⇥ ⇤ = E τ ∼ θ old r θ log P ( τ | θ ) | θ old R ( τ ) Suggests we can also look at more than just gradient! E.g., can use importance sampled objecMve as “surrogate loss” (locally) [Tang&Abbeel, NIPS 2011] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  4. Likelihood RaMo Gradient: Validity m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Valid even if R is disconMnuous, and unknown, or sample space (of paths) is a discrete set John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  5. Likelihood RaMo Gradient: IntuiMon m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 n Gradient tries to: n Increase probability of paths with posiMve R n Decrease probability of paths with negaMve R ! Likelihood raMo changes probabiliMes of experienced paths, does not try to change the paths (see Path DerivaMve later) John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  6. Let’s Decompose Path into States and AcMons John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  7. Let’s Decompose Path into States and AcMons John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  8. Let’s Decompose Path into States and AcMons John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  9. Let’s Decompose Path into States and AcMons John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  10. Likelihood RaMo Gradient EsMmate John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  11. Likelihood RaMo Gradient EsMmate n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world pracMcality n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick, equally applicable to perturbaMon analysis and finite differences) John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  12. Likelihood RaMo Gradient: Baseline m g = 1 X r θ log P ( τ ( i ) ; θ ) R ( τ ( i ) ) r U ( θ ) ⇡ ˆ m i =1 To build intuiMon, let’s assume R > 0 sMll unbiased n [Williams 1992] n Then tries to increase probabiliMes of all paths E [ r θ log P ( τ ; θ ) b ] X = P ( τ ; θ ) r θ log P ( τ ; θ ) b à Consider baseline b: τ P ( τ ; θ ) r θ P ( τ ; θ ) X = P ( τ ; θ ) b m g = 1 τ X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) X r U ( θ ) ⇡ ˆ = r θ P ( τ ; θ ) b τ m X ! i =1 = r θ P ( τ ) b Good choices for b? τ = r θ ( b ) =0 m b = E [ R ( τ )] ≈ 1 X R ( τ ( i ) ) m i =1 [See: Greensmith, BartleB, Baxter, JMLR 2004 for variance reducMon techniques.]

  13. Likelihood RaMo and Temporal Structure Current esMmate: m g = 1 n X r θ log P ( τ ( i ) ; θ )( R ( τ ( i ) ) � b ) ˆ m i =1 m H − 1 ! H − 1 ! = 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) t , u ( i ) X X X t ) t ) � b m i =1 t =0 t =0 Future acMons do not depend on past rewards, hence can lower variance n by instead using: H − 1 H − 1 m ! 1 r θ log π θ ( u ( i ) t | s ( i ) R ( s ( i ) k , u ( i ) k ) � b ( s ( i ) X X X t ) k ) m i =1 t =0 k = t Good choice for b? n Expected return: b ( s t ) = E [ r t + r t +1 + r t +2 + . . . + r H − 1 ] à Increase logprob of acMon proporMonally to how much its returns are beBer than the expected return under the current policy [Policy Gradient Theorem: SuBon et al, NIPS 1999; GPOMDP: BartleB & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  14. Pseudo-code Reinforce aka Vanilla Policy Gradient ~ [Williams, 1992] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  15. Outline DerivaMve free methods n Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed n Likelihood RaMo (LR) Policy Gradient n DerivaMon / ConnecMon w/Importance Sampling n Natural Gradient / Trust Regions (-> TRPO) n Variance Reduc'on using Value Func'ons (Actor-Cri'c) (-> GAE, A3C) n Pathwise Deriva'ves (PD) (-> DPG, DDPG, SVG) n Stochas'c Computa'on Graphs (generalizes LR / PD) n Guided Policy Search (GPS) n Inverse Reinforcement Learning n

  16. Trust Region Policy Optimization

  17. Desiderata Desiderata for policy optimization method: I Stable, monotonic improvement. (How to choose stepsizes?) I Good sample e ffi ciency

  18. Step Sizes Why are step sizes a big deal in RL? I Supervised learning I Step too far → next updates will fix it I Reinforcement learning I Step too far → bad policy I Next batch: collected under bad policy I Can’t recover, collapse in performance!

  19. Surrogate Objective I Let η ( π ) denote the expected return of π I We collect data with π old . Want to optimize some objective to get a new policy π I Define L π old ( π ) to be the “surrogate objective” 1  π ( a | s ) � π old ( a | s ) A π old ( s , a ) L ( π ) = E π old � � r θ L ( π θ ) θ old = r θ η ( π θ ) (policy gradient) � � θ old I Local approximation to the performance of the policy; does not depend on parameterization of π 1 S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. In: ICML . vol. 2. 2002, pp. 267–274.

  20. Improvement Theory I Theory: bound the di ff erence between L π old ( ⇡ ) and ⌘ ( ⇡ ), the performance of the policy I Result: ⌘ ( ⇡ ) ≥ L π old ( ⇡ ) − C · max s KL[ ⇡ old ( · | s ) , ⇡ ( · | s )], where c = 2 ✏� / (1 − � ) 2 I Monotonic improvement guaranteed (MM algorithm)

  21. Practical Algorithm: TRPO I Constrained optimization problem max L ( π ) , subject to KL[ π old , π ] ≤ δ π  π ( a | s ) � where L ( π ) = E π old π old ( a | s ) A π old ( s , a ) I Construct loss from empirical data N π ( a n | s n ) ˆ X ˆ L ( π ) = A n π old ( a n | s n ) n =1 I Make quadratic approximation and solve with conjugate gradient algorithm J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

  22. Practical Algorithm: TRPO for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F − 1 g Do line search on surrogate loss and KL constraint end for J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

  23. Practical Algorithm: TRPO Applied to I Locomotion controllers in 2D I Atari games with pixel input J. Schulman, S. Levine, P. Moritz, et al. “Trust Region Policy Optimization”. In: ICML . 2015

  24. “Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1

  25. “Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1 I Pseudocode: for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β . If KL too low, decrease β . end for

  26. “Proximal” Policy Optimization I Use penalty instead of constraint N π θ ( a n | s n ) X ˆ minimize A n − β KL[ π θ old , π θ ] π θ old ( a n | s n ) θ n =1 I Pseudocode: for iteration=1 , 2 , . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β . If KL too low, decrease β . end for I ≈ same performance as TRPO, but only first-order optimization

  27. Variance Reduction Using Value Functions

  28. Variance Reduction I Now, we have the following policy gradient formula: " T − 1 # X r θ log π ( a t | s t , θ ) A π ( s t , a t ) r θ E τ [ R ] = E τ t =0 I A π is not known, but we can plug in ˆ A t , an advantage estimator I Previously, we showed that taking ˆ A t = r t + r t +1 + r t +2 + · · · � b ( s t ) for any function b ( s t ), gives an unbiased policy gradient estimator. b ( s t ) ⇡ V π ( s t ) gives variance reduction.

  29. The Delayed Reward Problem I With policy gradient methods, we are confounding the e ff ect of multiple actions: ˆ A t = r t + r t +1 + r t +2 + · · · − b ( s t ) mixes e ff ect of a t , a t +1 , a t +2 , . . . I SNR of ˆ A t scales roughly as 1 / T I Only a t contributes to signal A π ( s t , a t ), but a t +1 , a t +2 , . . . contribute to noise.

  30. Variance Reduction with Discounts I Discount factor γ , 0 < γ < 1, downweights the e ff ect of rewars that are far in the future—ignore long term dependencies I We can form an advantage estimator using the discounted return : ˆ t = r t + γ r t +1 + γ 2 r t +2 + . . . A γ − b ( s t ) | {z } discounted return reduces to our previous estimator when γ = 1. I So advantage has expectation zero, we should fit baseline to be discounted value function ⇥ ⇤ r 0 + γ r 1 + γ 2 r 2 + . . . | s 0 = s V π , γ ( s ) = E τ I Discount γ is similar to using a horizon of 1 / (1 − γ ) timesteps I ˆ A γ t is a biased estimator of the advantage function

  31. Value Functions in the Future I Baseline accounts for and removes the e ff ect of past actions I Can also use the value function to estimate future rewards r t + γ V ( s t +1 ) cut o ff at one timestep r t + γ r t +1 + γ 2 V ( s t +2 ) cut o ff at two timesteps . . . r t + γ r t +1 + γ 2 r t +2 + . . . ∞ timesteps (no V )

  32. Value Functions in the Future I Subtracting out baselines, we get advantage estimators A (1) ˆ = r t + γ V ( s t +1 ) − V ( s t ) t A (2) ˆ = r t + r t +1 + γ 2 V ( s t +2 ) − V ( s t ) t . . . A ( ∞ ) ˆ = r t + γ r t +1 + γ 2 r t +2 + . . . − V ( s t ) t A (1) A ( ∞ ) I ˆ has low variance but high bias, ˆ has high variance but low bias. t t I Using intermediate k (say, 20) gives an intermediate amount of bias and variance

  33. Finite-Horizon Methods: Advantage Actor-Critic I A2C / A3C uses this fixed-horizon advantage estimator V. Mnih, A. P. Badia, M. Mirza, et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML (2016)

  34. Finite-Horizon Methods: Advantage Actor-Critic I A2C / A3C uses this fixed-horizon advantage estimator I Pseudocode for iteration=1 , 2 , . . . do Agent acts for T timesteps (e.g., T = 20), For each timestep t , compute R t = r t + γ r t +1 + · · · + γ T − t +1 r T − 1 + γ T − t V ( s t ) ˆ A t = ˆ ˆ R t � V ( s t ) ˆ R t is target value function, in regression problem ˆ A t is estimated advantage function h R t ) 2 i P T � log π θ ( a t | s t ) ˆ A t + c ( V ( s ) � ˆ Compute loss gradient g = r θ t =1 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, et al. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML (2016)

  35. A3C Video

  36. A3C Results

  37. TD( λ ) Methods: Generalized Advantage Estimation I Recall, finite-horizon advantage estimators A ( k ) ˆ = r t + γ r t +1 + · · · + γ k − 1 r t + k − 1 + γ k V ( s t + k ) − V ( s t ) t I Define the TD error δ t = r t + γ V ( s t +1 ) − V ( s t ) I By a telescoping sum, A ( k ) ˆ = δ t + γδ t +1 + · · · + γ k − 1 δ t + k − 1 t I Take exponentially weighted average of finite-horizon estimators: A λ = ˆ + λ 2 ˆ ˆ A (1) + λ ˆ A (2) A (3) + . . . t t t I We obtain ˆ A λ t = δ t + ( γλ ) δ t +1 + ( γλ ) 2 δ t +2 + . . . I This scheme named generalized advantage estimation (GAE) in [1], though versions have appeared earlier, e.g., [2]. Related to TD( λ ) J. Schulman, P. Moritz, S. Levine, et al. “High-dimensional continuous control using generalized advantage estimation”. In: ICML . 2015 H. Kimura and S. Kobayashi. “An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function.” In: ICML . 1998, pp. 278–286

  38. Choosing parameters γ , λ Performance as γ , λ are varied

  39. TRPO+GAE Video

  40. Pathwise Derivative Policy Gradient Methods

  41. Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ )

  42. Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ ) I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T . . . z 1 z 2 z T

  43. Deriving the Policy Gradient, Reparameterized I Episodic MDP: θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T Want to compute r θ E [ R T ]. We’ll use r θ log ⇡ ( a t | s t ; ✓ ) I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ s 1 s 2 . . . s T R T a 1 a 2 . . . a T . . . z 1 z 2 z T I Only works if P ( s 2 | s 1 , a 1 ) is known ¨ _

  44. Using a Q -function θ . . . s 1 s 2 s T R T a 1 a 2 . . . a T z 1 z 2 . . . z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " T " T # # d Q ( s t , a t ) d a t d X X = E = E d θ Q ( s t , π ( s t , z t ; θ )) d a t d θ t =1 t =1

  45. SVG(0) Algorithm I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

  46. SVG(0) Algorithm I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. I Pseudocode: for iteration=1 , 2 , . . . do Execute policy π θ to collect T timesteps of data P T Update π θ using g / r θ t =1 Q ( s t , π ( s t , z t ; θ )) P T t =1 ( Q φ ( s t , a t ) � ˆ Q t ) 2 , e.g. with TD( λ ) Update Q φ using g / r φ end for N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

  47. SVG(1) Algorithm θ s 1 s 2 . . . s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T I Instead of learning Q , we learn I State-value function V ≈ V π , γ I Dynamics model f , approximating s t +1 = f ( s t , a t ) + ζ t I Given transition ( s t , a t , s t +1 ), infer ζ t = s t +1 − f ( s t , a t ) I Q ( s t , a t ) = E [ r t + γ V ( s t +1 )] = E [ r t + γ V ( f ( s t , a t ) + ζ t )], and a t = π ( s t , θ , ζ t )

  48. SVG( ∞ ) Algorithm θ . . . s 1 s 2 s T . . . R T a 1 a 2 a T . . . z 1 z 2 z T I Just learn dynamics model f I Given whole trajectory, infer all noise variables I Freeze all policy and dynamics noise, di ff erentiate through entire deterministic computation graph

  49. SVG Results I Applied to 2D robotics tasks I Overall: di ff erent gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015

  50. Deterministic Policy Gradient I For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero I But SVG(0) gradient is fine when σ ! 0 X Q ( s t , π ( s t , θ , ζ t )) r θ t I Problem: there’s no exploration. I Solution: add noise to the policy, but estimate Q with TD(0), so it’s valid o ff -policy I Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is o ff —it gets the right gradient at every state D. Silver, G. Lever, N. Heess, et al. “Deterministic policy gradient algorithms”. In: ICML . 2014

  51. Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

  52. Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability I Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π , γ ) with TD(0) ˆ Q t = r t + γ Q φ 0 ( s t +1 , π ( s t +1 ; θ 0 )) T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

  53. Deep Deterministic Policy Gradient I Incorporate replay bu ff er and target network ideas from DQN for increased stability I Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π , γ ) with TD(0) ˆ Q t = r t + γ Q φ 0 ( s t +1 , π ( s t +1 ; θ 0 )) I Pseudocode: for iteration=1 , 2 , . . . do Act for several timesteps, add data to replay bu ff er Sample minibatch P T Update π θ using g / r θ t =1 Q ( s t , π ( s t , z t ; θ )) t =1 ( Q φ ( s t , a t ) � ˆ P T Q t ) 2 , Update Q φ using g / r φ end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

  54. DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, et al. “Continuous control with deep reinforcement learning”. In: ICLR (2015)

  55. Policy Gradient Methods: Comparison I Two kinds of policy gradient estimator I REINFORCE / score function estimator: r log π ( a | s ) ˆ A . I Learn Q or V for variance reduction, to estimate ˆ A I Pathwise derivative estimators (di ff erentiate wrt action) I SVG(0) / DPG: d d a Q ( s , a ) (learn Q ) I SVG(1): d d a ( r + γ V ( s 0 )) (learn f , V ) I SVG( 1 ): d a t ( r t + γ r t +1 + γ 2 r t +2 + . . . ) (learn f ) d I Pathwise derivative methods more sample-e ffi cient when they work (maybe), but work less generally due to high bias

  56. Policy Gradient Methods: Comparison Y. Duan, X. Chen, R. Houthooft, et al. “Benchmarking Deep Reinforcement Learning for Continuous Control”. In: ICML (2016)

  57. Stochastic Computation Graphs

  58. Gradients of Expectations Want to compute r θ E [ F ]. Where’s θ ? I In distribution, e.g., E x ∼ p ( · | θ ) [ F ( x )] I r θ E x [ f ( x )] = E x [ f ( x ) r θ log p x ( x ; θ )] . I Score function estimator I Example: REINFORCE policy gradients, where x is the trajectory I Outside distribution: E z ∼ N (0 , 1) [ F ( θ , z )] r θ E z [ f ( x ( z , θ ))] = E z [ r θ f ( x ( z , θ ))] . I Pathwise derivative estimator I Example: SVG policy gradient I Often, we can reparametrize, to change from one form to another I What if F depends on θ in complicated way, a ff ecting distribution and F ? M. C. Fu. “Gradient estimation”. In: Handbooks in operations research and management science 13 (2006), pp. 575–616

  59. Stochastic Computation Graphs I Stochastic computation graph is a DAG, each node corresponds to a deterministic or stochastic operation I Can automatically derive unbiased gradient estimators, with variance reduction Computation Graphs Stochastic Computation Graphs L L stochastic node J. Schulman, N. Heess, T. Weber, et al. “Gradient Estimation Using Stochastic Computation Graphs”. In: NIPS . 2015

  60. Worked Example θ a b c d e φ I L = c + e . Want to compute d d d θ E [ L ] and d φ E [ L ]. I Treat stochastic nodes ( b , d ) as constants, and introduce losses logprob ∗ ( futurecost ) at each stochastic node I Obtain unbiased gradient estimate by di ff erentiating surrogate: + log p (ˆ Surrogate ( θ , ψ ) = c + e b | a , d )ˆ c | {z } | {z } (1) (2) (1): how parameters influence cost through deterministic dependencies (2): how parameters a ff ect distribution over random variables.

  61. Outline DerivaMve free methods n Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed n Likelihood RaMo (LR) Policy Gradient n DerivaMon / ConnecMon w/Importance Sampling n Natural Gradient / Trust Regions (-> TRPO) n Variance ReducMon using Value FuncMons (Actor-CriMc) (-> GAE, A3C) n Pathwise DerivaMves (PD) (-> DPG, DDPG, SVG) n StochasMc ComputaMon Graphs (generalizes LR / PD) n Guided Policy Search (GPS) n Inverse Reinforcement Learning n

  62. Goal n Find parameterized policy that opMmizes: π θ ( u t | x t ) T X J ( θ ) = E π θ ( x t , u t ) [ l ( x t , u t )] t =1 T n NotaMon: Y π θ ( τ ) = p ( x 1 ) p ( x t +1 | x t , u t ) π θ ( u t | x t ) t =1 τ = { x 1 , u 1 , . . . , x T , u T } n RL takes lots of data… Can we reduce to supervised learning? John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  63. Naïve SoluMon n Step 1: n Consider sampled problem instances i = 1 , 2 , . . . , I n Find a trajectory-centric controller for each problem instance π i ( u t | x t ) n Step 2: n Supervised training of neural net to match all π i ( u t | x t ) X π θ ← arg min D KL ( p i ( τ ) || π θ ( τ )) θ i n ISSUES: n Compounding error (Ross, Gordon, Bagnell JMLR 2011 “Dagger”) n Mismatch train vs. test E.g., Blind peg, Vision,… John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  64. (Generic) Guided Policy Search n OpMmizaMon formulaMon: ParMcular form of the constraint varies depending on the specific method: Dual gradient descent: Levine and Abbeel, NIPS 2014 Penalty methods: Mordatch, Lowrey, Andrew, Popovic, Todorov, NIPS 2016 ADMM: Mordatch and Todorov, RSS 2014 Bregman ADMM: Levine, Finn, Darrell, Abbeel, JMLR 2016 Mirror Descent: Montgomery, Levine, NIPS 2016 John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  65. [Levine & Abbeel, NIPS 2014] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  66. [Levine & Abbeel, NIPS 2014] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  67. Comparison [Levine, Wagener, Abbeel, ICRA 2015] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

  68. Block Stacking – Learning the Controller for a Single Instance [Levine, Wagener, Abbeel, ICRA 2015] John Schulman & Pieter Abbeel – OpenAI + UC Berkeley

Recommend


More recommend