learning to optimize as policy learning
play

Learning to Optimize as Policy Learning Yisong Yue Policy Learning - PowerPoint PPT Presentation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement & Imitation) Goal: Find Optimal Policy State/Context s t Agent Imitation Learning: Optimize imitation loss Reinforcement Learning: Optimize


  1. Learning to Optimize as Policy Learning Yisong Yue

  2. Policy Learning (Reinforcement & Imitation) Goal: Find “Optimal” Policy State/Context s t Agent Imitation Learning: Optimize imitation loss Reinforcement Learning: Optimize environmental reward s t+1 Environment / World Learning-based Approach for Sequential Decision Making

  3. Basic Formulation State/Context s t Agent (Typically a Neural Net) d • Policy: ! " → $(&) s t+1 Environment / World State Action ontrol) • Roll-out: τ = " * , & * , " , , & , , " - , … (aka trace or trajectory) Transition Function: P(s’|s,a) • Objective: ∑ 0(" 1 , & 1 ) 1

  4. Optimization as Sequential Decision Makin • Many Solvers are Sequential • Tree-Search • Greedy • Gradient Descent • Can view solver as “agent” or “policy” • State = intermediate solution Formalize Learning P • • Find a state with high reward (solution) Builds upon mod • • Learn better local decision making Theoretical Analysis/ • Interesting Algorithm •

  5. Example #1: Learning to Search (Discrete) Integer Program Tree-Search (Branch and Boun State = partial search tree (need to featurize) Action = variable selection or branching � Sparse Reward @ feasible solution [He et al., 2014][Khalil et al., 2016] [Song et al., arXiv]

  6. Example #1: Learning to Search (Discrete) Integer Program Tree-Search (Branch and Boun State = partial search tree (need to featurize) Action = variable selection or branching Deterministic State Transitions • � Massive State Space • Sparse Rewards • Sparse Reward @ feasible solution [He et al., 2014][Khalil et al., 2016] [Song et al., arXiv]

  7. Example #2: Learning Greedy Algorithms (discrete Submodula Contextual Submodular Maximization: &02max : ; (Ψ) 6: 6 89 Selecte • Greedy Sequential Selection: Context / Environment • Ψ ← Ψ ⨁ argmax : ; (Ψ⨁&) B Not Available at Test Time • Train policy to mimic greedy: • ! " → & State s = (C, D) Dictionary of Trajectories Select D Learning Policies for Contextual Submodular Prediction S. Ross, R. Zhou, Y. Yue, D. Dey, J.A. Bagnell. ICML

  8. Example #2: Learning Greedy Algorithms (discrete Submodula Contextual Submodular Maximization: &02max : ; (Ψ) 6: 6 89 Selecte • Greedy Sequential Selection: Context / Environment • Ψ ← Ψ ⨁ argmax : ; (Ψ⨁&) B Not Available at Test Time • Train policy to mimic greedy: Deterministic State Transitions • • ! " → & Massive State Space • Dense Rewards • State s = (C, D) Note: Not Learning Submodular • Dictionary of Trajectories Select D Learning Policies for Contextual Submodular Prediction S. Ross, R. Zhou, Y. Yue, D. Dey, J.A. Bagnell. ICML

  9. Example #3: Iterative Amortized Inference (contin • State = description of problem & curren Gradient Descent Style Updates: • Action = next point Useful for Accelerating Variational Inference Iterative Amortized Inference, Joe Marino, Yisong Yue, Stephan Mandt. ICML 2018

  10. Example #3: Iterative Amortized Inference (contin • State = description of problem & curren Gradient Descent Style Updates: • Action = next point (Mostly) Deterministic State Transitions • Continuous State Space • Dense Rewards • Simplest Case: One-Shot Inference • “Variational Autoencoders” [Kingma & Welling, ICLR 2014] • Useful for Accelerating Variational Inference Iterative Amortized Inference, Joe Marino, Yisong Yue, Stephan Mandt. ICML 2018

  11. Optimization as Sequential Decision Makin Learning to Search • Discrete Optimization (Tree Search), Sparse Rewards • Learning to Search via Retrospective Imitation [arXiv] Jialin So • Co-training for Policy Learning [UAI 2019] Contextual Submodular Maximization • Discrete Optimization (Greedy), Dense Rewards • Learning Policies for Contextual Submodular Prediction [ICML 2013] Stephane Learning to Infer • Continuous Optimization (Gradient-style), Dense Rewards • Iterative Amortized Inference [ICML 2018] • A General Method for Amortizing Variational Filtering [NeurIPS 2018] Joe Ma

  12. Optimization as Sequential Decision Makin Learning to Search • Discrete Optimization (Tree Search), Sparse Rewards • Learning to Search via Retrospective Imitation [arXiv] Jialin So • Co-training for Policy Learning [UAI 2019] Contextual Submodular Maximization • Discrete Optimization (Greedy), Dense Rewards • Learning Policies for Contextual Submodular Prediction [ICML 2013] Stephane Learning to Infer • Continuous Optimization (Gradient-style), Dense Rewards • Iterative Amortized Inference [ICML 2018] • A General Method for Amortizing Variational Filtering [NeurIPS 2018] Joe Ma

  13. Learning to Optimize for Tree Search • Idea #1: Treat as Standard RL • Randomly explore for high rewards • Very hard exploration problem! • Issues: massive state space & sparse rewards �

  14. Learning to Optimize for Tree Search • Idea #2: Treat as Standard IL • Convert to Supervised Learning • Assume access to solved instances “Demonstration Data” , • Training Data: E * = � • Basic IL: argmin K L M (!) ≡ O P,B ~L M ℓ(&, ! " ) H∈J Behavioral Cloning

  15. Challenges w/ Imitation Learning • Issues with Behavioral Cloning • Minimize K L M … implications? • If ! makes a mistake early, subsequent state distribution ≈ E * ?? • Some extensions to Interactive IL [He et al., NeurIPS 2014] Our Approach is also Interactive IL • Demonstrations not Available on Large Problems • How to (formally) bootstrap from smaller problems? • Bridging the gap between IL & RL Our Approach gives one solution

  16. Retrospective Imitation Jialin Song • Given: • Family of Distributions of Search problems Difficulty levels: k=1,…,K • Family is parameterized by size/difficulty • Solved Instances on the Smallest/Easiest Instances • “Demonstrations” • Goal: • Interactive IL approach Connections to Curriculu • Can Scale up from Smallest/Easiest Instances & Transfer Learning • Formal Guarantees Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

  17. Retrospective Imitation • Two-Stage Algorithm • Core Algorithm Interactive IL w/ Sparse Environmen • Fixed problem difficulty • Reductions to Supervised Learning • Full Algorithm w/ Scaling Up • Uses Core Algorithm as Subroutine Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

  18. Retrospective Imitation (Core Algorithm) Roll-out Trace Expert Trace · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Repeat · · · . · · · . . · · · · · · . . . · · · · · · · · · · · · · · · · · · · · · ? · · · · · · ? · · · � Policy Roll-out (optional exploration) · · · · · · � Retrospective Oracle 3 (Algorithm 2) � Initial Learning 1 2 · · · · · · · · · · · · · · · · · · Supervised Learning · · · · · · · · · Imitation � Policy Update with Further Learning 4 Reduction Learning · · · · · · . . . Policy Region A · · · · · · · · · Derive Enviro ? · · · Retrospective Oracle Feedback Region B · · · Figure 1. A visualization of retrospective imitation learning depicting components of Algorithm 1. An imitation learning polic Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

  19. Retrospective Imitation (Full Algorithm) Problem Initialize k=1 Core Algori Difficulty k Instances & Demonstrations Initialize Base Solver Gurobi/SCIP/CPlex k=k+1 Use trained S Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

  20. Core Algorithm Does this converge? • Converges to what? • Learning to Search via Retrospective Imitation, Jialin Song, Ravi Lanka, et al., arXiv

  21. Imitation Learning Tutorial (ICML 2018) https://sites.google.com/view/icml2018-imitation-learning/ Yisong Yue Hoang M. Le yyue@caltech.edu hmle@caltech.edu @YisongYue @HoangMinhLe yisongyue.com hoangle.info

  22. Issues w/ Distribution Drift & Imitation Sig , • Demonstrations from initial Solver: E * = “correct” decision in this state Which input states? Correct relative to what? • Supervised learning: argmin K L M (!) ≡ O P,B ~L M ℓ(&, ! " ) H∈J Oracle call to TensorFlow/PyTorch/etc… If S achieves low error on T U , so what?

  23. Interactive Imitation Learning (Core Alg) • First popularized by [Daume et al., 2009] [Ross et al., 2011] • Basic idea: • Train ! 1V, = argmin K L WXY (!) Supervised Learning H∈J i=i+1 • Roll-out ! 1V, , collect traces Z Run on instances [ 1 • Demonstrator converts Z into per-state feedback: E Depends on w [ 1 ∪ E 1V, • E 1 = E Data aggregation Search-based Structured Prediction, Daume, Langford, Marcu, Machine Learning Journal 2009 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , Ross, Gordon, Bagne

Recommend


More recommend