gen enerativ erative e adver ersaria sarial l im
play

Gen enerativ erative e Adver ersaria sarial l Im Imitation - PowerPoint PPT Presentation

Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song Reinforcement Learning Goal: Learn


  1. Gen enerativ erative e Adver ersaria sarial l Im Imitation itation Le Learning arning Stefan efano o Ermon on Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song

  2. Reinforcement Learning • Goal: Learn policies • High-dimensional, raw observations action

  3. Reinforcement Learning • MDP: Model for (stochastic) sequential decision making problems +5 +1 • States S • Actions A • Co Cost st function (immediate): C: SxA  R • Transition Probabilities: P(s’| s,a) 0 • Policy: mapping from states to actions – E.g., (S 0 ->a 1 , S 1 ->a 0 , S 2 ->a 0 ) • Reinforcement learning: minimize total (expected, discounted) cost  1 T  ( ) c s t  t 0

  4. Reinforcement Learning Cost Function Optimal Reinforcement c(s,a) policy p Learning (RL) Policy: mapping from states to actions C: SxA  R Environment E.g., (S 0 ->a 1 , (MDP) S 1 ->a 0 , Cost S 2 ->a 0 ) +5 +1 • States S • Actions A RL nee eeds ds • Transitions: P( s’ |s, a) cost st sig ignal 0

  5. Imitation Input: expert behavior generated by π E Goal: learn cost function (reward) or policy (Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al., 2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.

  6. Behavioral Cloning (State,Action) (State,Action) Policy … (State,Action) Supervised Learning (regression) • Small errors compound over time (cascading errors) • Decisions are purposeful (require planning)

  7. Inverse RL • An approach to imitation • Learns a cost c such that

  8. Problem setup Optimal Cost Function Reinforcement c(s) policy p Learning (RL) Environment (MDP) Inverse Reinforcement Expert’s Trajectories Cost Function Learning (IRL) s 0 , s 1 , s 2 , … c(s) Expert has Everything else (Ziebart et al., 2010; small cost has high cost Rust 1987) 15

  9. Problem setup Optimal Cost Function Reinforcement c(s) policy p Learning (RL) ≈ ? Environment (similar wrt ψ ) (MDP) Inverse Reinforcement Expert’s Trajectories Cost Function Learning (IRL) s 0 , s 1 , s 2 , … c(s) Convex cost regularizer 16

  10. Combining RL o IRL ρ p = occupancy measure = Optimal Reinforcement distribution of state-action pairs policy p Learning (RL) encountered when navigating ≈ the environment with the policy (similar w.r.t. ψ ) ρ pE = Expert’s ψ -regularized Inverse Expert’s Trajectories Reinforcement occupancy measure s 0 , s 1 , s 2 , … Learning (IRL) Theorem orem: ψ -regularized inverse reinforcement learning, implicitly, seeks ks a polic icy y whose e occupancy ncy measure sure is close e to the expert’s , as measured by ψ * (convex conjugate of ψ ) 17

  11. Takeaway Theorem rem: ψ -regularized inverse reinforcement learning, implicitly, se seeks s a poli licy cy wh whose se occupancy upancy measure sure is s close to the expert’s , as measured by ψ * • Typical IRL definition: finding a cost function c such that the expert policy is uniquely optimal w.r.t. c • Alternative view: IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure (gene nerati rative model)

  12. Special cases • If ψ (c)=constant, then – Not a useful algorithm. In practice, we only have sampled trajectories • Ov Overfitting: itting: Too much flexibility in choosing the cost function (and the policy) All cost functions ψ (c)=constant

  13. Towards Apprenticeship learning • Solu lution: tion: use fea eatu tures res f s,a ) = θ . f s,a • Cost st c( c(s,a s,a) Only these “simple” cost functions are allowed ψ(c)= ∞ Linear in features All cost functions ψ (c)= 0 20

  14. Apprenticeship learning • For that choice of ψ , RL o IRL ψ framework gives apprenticeship learning • Apprenticeship learning: find π performing better than π E over costs linear in the features – Abbeel and Ng (2004) – Syed and Schapire (2007)

  15. Apprenticeship learning • Given • Goal: find π performing better than π E over a class of costs Approximated using demonstrations

  16. Issues with Apprenticeship learning • Need to craft features very carefully – unless the true expert cost function (assuming it exists) lies in C, there is no guarantee that AL will recover the expert policy • RL o IRL ψ ( p E ) is “encoding” the expert behavior as a cost function in C. – it might not be possible to decode it back if C is too simple All cost functions RL p R IRL p E

  17. Generative Adversarial Imitation Learning • Solu lution tion: use a more expressive class of cost functions All cost functions Linear in features

  18. Generative Adversarial Imitation Learning • ψ * = optimal negative log-loss of the binary classification problem of distinguishing between state- action pairs of π and π E Policy π Expert Policy π E D

  19. Generative Adversarial Networks Figure from Goodfellow et al, 2014

  20. GAIL D tries to D tries to output 1 output 0 Differentiable Differentiable function D function D Sample from Sample from model expert Simulator Black box simulator (Environment) Generator G Differentiable function P Ho and Ermon, Generative Adversarial Imitation Learning

  21. How to optimize the objective • Previous Apprenticeship learning work: – Full dynamics model – Small environment – Repeated RL • We propose: gradient descent over policy parameters (and discriminator) J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. ICML 2016.

  22. Properties • Inherits pros of policy gradient – Convergence to local minima – Can be model free • Inherits cons of policy gradient – High variance – Small steps required

  23. Properties • Inherits pros of policy gradient – Convergence to local minima – Can be model free • Inherits cons of policy gradient – High variance – Small steps required • Solu lution: tion: tr trust t reg egion ion policy licy optim timizat ization ion

  24. Results

  25. Results Input: driving demonstrations (Torcs) Output policy: From m raw visual al inputs uts Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

  26. Experimental results

  27. Latent structure in demonstrations Human model Observed Latent variables Environment Policy Behavior Z Semantically meaningful latent structure? 35

  28. InfoGAIL Observed Latent structure data Infer structure Hou el al. Maximize mutual information Observed Latent variables Environment Policy Behavior Z

  29. InfoGAIL Latent code Maximize mutual information (s,a ,a) Observed Environment c Policy Z Behavior

  30. Synthetic Experiment Demonstrations GAIL Info-GAIL Demonstrations

  31. InfoGAIL Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations model Latent variables Trajectories Environment Policy Z Pass right (z=1) Pass left (z=0) 40

  32. InfoGAIL Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations model Latent variables Trajectories Environment Policy Z Turn outside (z=1) Turn inside (z=0) 41

  33. Multi-agent environments What are the goals of these 4 agents?

  34. Problem setup Optimal policies p1 Cost Functions c 1 (s,a 1 ) … MA Reinforcement .. Learning (MARL) c N (s,a N ) Optimal policies pK Environment (Markov Game) R L R 0,0 10,10 L 10,10 0,0

  35. Problem setup Cost Functions Optimal c 1 (s,a 1 ) MA Reinforcement .. policies p Learning (MARL) c N (s,a N ) ≈ Environment (similar wrt ψ ) (Markov Game) Expert’s Trajectories Cost Functions (s (s 0 ,a ,a 0 1 ,..a 0 N ) c 1 (s,a 1 ) Inverse Reinforcement .. (s 1 ,a ,a 1 1 ,..a 1 N ) Learning (MAIRL) c N (s,a N ) … 46

  36. MAGAIL D 1 tries D 2 tries D 2 tries D1 tries D N tries D N tries to to to to to to output 0 output 1 output 0 output 1 output 0 output 1 Diff. Diff. Diff. Diff. Diff. Diff. … … function function function function function function D 1 D 2 D1 D 2 D N D N Sample from model Sample from expert (s,a 1 ,a 2 ,…, a N ) (s,a 1 ,a 2 ,…, a N ) Black box simulator Generator G Policy Policy Agent N Agent 1 Song, Ren, Sadigh, Ermon, Multi-Agent Generative Adversarial Imitation Learning

  37. Environments Demonstrations MAGAIL

  38. Environments Demonstrations MAGAIL

  39. Suboptimal demos MAGAIL Expert lighter plank + bumps on ground

  40. Conclusions • IRL is a dual of an occupancy measure matching problem (generative modeling) • Might need flexible cost functions – GAN style approach • Policy gradient approach – Scales to high dimensional settings • Towards unsupervised learning of latent structure from demonstrations 51

Recommend


More recommend