the peculiar optimization and regularization challenges
play

The Peculiar Optimization and Regularization Challenges in - PowerPoint PPT Presentation

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning Chelsea Finn Stanford training data test datapoint Braque Cezanne By Braque or Cezanne? How did you accomplish this? Through previous


  1. The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning Chelsea Finn Stanford

  2. training data test datapoint Braque Cezanne By Braque or Cezanne?

  3. How did you accomplish this? Through previous experience.

  4. How might you get a machine to accomplish this task? Modeling image formaKon Geometry Fewer human priors, more data -driven priors SIFT features, HOG features + SVM Greater success. Fine-tuning from ImageNet features Domain adaptaKon from other painters ??? Can we explicitly learn priors from previous experience that lead to efficient downstream learning? Can we learn to learn?

  5. Outline 1. Brief overview of meta-learning 2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) 3. Can we scale meta-learning to broad task distributions?

  6. How does meta-learning work? An example. Given 1 example of 5 classes: Classify new examples test set training data

  7. How does meta-learning work? An example. training meta-training classes … … Given 1 example of 5 classes: Classify new examples meta-testing T test test set training data

  8. How does meta-learning work? One approach : parameterize learner by neural network 4 0 1 2 3 4 y ts = f ( 𝒠 tr , x ts ; θ ) (Hochreiter et al. ’91, Santoro et al. ’16, many others)

  9. How does meta-learning work? Another approach : embed optimization inside the learning process 4 r θ L y ts = f ( 𝒠 tr , x ts ; θ ) 0 1 2 3 4 (Maclaurin et al. ’15, Finn et al. ’17, many others)

  10. Can we learn a representation under which RL is fast and efficient? after 1 gradient step after 1 gradient step after MAML training (backward reward) (forward reward) Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks . ICML ‘17

  11. Can we learn a representation under which imitation is fast and e ffi cient? input demo resulting policy (via teleoperation) subset of training objects [real-time execution] held-out test objects Finn*, Yu*, Zhang, Abbeel, Levine. One-Shot Visual Imitation Learning via Meta-Learning . CoRL ‘17

  12. The Bayesian perspective p ( ϕ | θ ) meta-learning <~> learning priors from data (Grant et al. ’18, Gordon et al. ’18, many others)

  13. Outline 1. Brief overview of meta-learning 2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) 3. Can we scale meta-learning to broad task distributions?

  14. How we construct tasks for meta-learning. 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 1 2 3 4 4 3 Randomly assign class labels to image classes for each task —> Tasks are mutually exclusive . Algorithms must use training data to infer label ordering.

  15. What if label order is consistent? 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 2 3 4 1 1 2 Tasks are non-mutually exclusive : a single function can solve all tasks. The network can simply learn to classify inputs, irrespective of 𝒠 tr

  16. The network can simply learn to classify inputs, irrespective of 𝒠 tr 4 1 2 3 4 0 4 r θ L 0 1 2 3 4

  17. What if label order is consistent? 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 2 3 4 1 1 2 For new image classes: can’t make predictions w/o 𝒠 tr T test training data test set

  18. Is this a problem? - No : for image classi fi cation, we can just shu ffl e labels* - No , if we see the same image classes as training (& don’t need to adapt at meta-test time) - But, yes , if we want to be able to adapt with data for new tasks.

  19. Another example “hammer” “close drawer” “stack” meta-training … T 50 “close box” T test If you tell the robot the task goal, the robot can ignore the trials. T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

  20. Another example Model can memorize the canonical orientations of the training objects. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  21. Can we do something about it?

  22. If tasks mutually exclusive : single function cannot solve all tasks (i.e. due to label shu ffl ing, hiding information) If tasks are non - mutually exclusive : single function can solve all tasks y ts = f θ ( D tr multiple solutions to the i , x ts ) meta-learning problem 𝒠 tr One solution: θ memorize canonical pose info in & ignore i 𝒠 tr Another solution: θ carry no info about canonical pose in , acquire from i An entire spectrum of solutions based on how information fl ows. Suggests a potential approach: control information fl ow. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  23. If tasks are non - mutually exclusive : single function can solve all tasks y ts = f θ ( D tr multiple solutions to the i , x ts ) meta-learning problem 𝒠 tr One solution: θ memorize canonical pose info in & ignore i 𝒠 tr Another solution: θ carry no info about canonical pose in , acquire from i An entire spectrum of solutions based on how information fl ows. one option: max I ( ̂ y ts , 𝒠 tr | x ts ) Meta-regularization minimize meta-training loss + information in θ ℒ ( θ , 𝒠 meta − train ) + β D KL ( q ( θ ; θ μ , θ σ ) ∥ p ( θ )) θ Places precedence on using information from over storing info in . 𝒠 tr Can combine with your favorite meta-learning algorithm. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  24. Omniglot without label shu ffl ing: “non-mutually-exclusive” Omniglot On pose prediction task: (and it’s not just as simple as standard regularization) TAML: Jamal & Qi. Task-Agnostic Meta-Learning for Few-Shot Learning . CVPR ‘19 Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  25. Does meta-regularization lead to better generalization? P ( θ ) θ Let be an arbitrary distribution over that doesn’t depend on the meta-training data. P ( θ ) = 𝒪 ( θ ; 0 , I ) (e.g. ) 1 − δ For MAML, with probability at least , ∀ θ μ , θ σ meta-regularization error on the generalization meta-training set error β With a Taylor expansion of the RHS + a particular value of —> recover the MR MAML objective . Proof: draws heavily on Amit & Meier ‘18 Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  26. 2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) Intermediate Takeaways meta overfitting standard overfitting f i ( x i , y i ) memorize training functions memorize training datapoints corresponding to tasks in your meta-training dataset in your training dataset meta regularization standard regularization controls information fl ow regularize hypothesis class regularizes description length (though not always for DNNs) of meta-parameters Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

  27. Outline 1. Brief overview of meta-learning 2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) 3. Can we scale meta-learning to broad task distributions?

  28. Has meta-learning accomplished our goal of making adaptation fast? Sort of… Can adapt to: - new objects - new goal velocities - new object categories Can we adapt to entirely new tasks or datasets?

  29. Can we adapt to entirely new tasks or datasets? —> Need broad distribution of tasks meta-train task = meta-test task for meta-training distribution distribution Can we look to RL benchmarks? Fan et al. SURREAL: Open-Source Reinforcement Learning Brockman et al. OpenAI Gym . 2016 Bellemare et al. Atari Learning Framework and Robot Manipulation Benchmark . CoRL 2018 Environment . 2016

  30. Our desiderata 50+ qualitatively distinct tasks shaped reward function & success metrics All tasks individually solvable (to allow us to focus on multi- task / meta-RL component) Uni fi ed state & action space, environment (to facilitate transfer) Meta-World Benchmark T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

  31. Results: Meta-learning algorithms seem to struggle… …even on the 45 meta-training tasks! Multi-task RL algorithms also struggle… T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

  32. Why the poor results? Exploration challenge? All tasks individually solvable. Data scarcity? All methods given budget with plenty of samples. Limited model capacity? All methods plenty of capacity. Training models independently performs the best. Our conclusion: must be an optimization challenge.

  33. Prior literature on multi-task learning Architectural solutions: Sluice Networks . Ruder, Bingel, Multi-Task Attention Network . Liu, Multi-head architectures Augenstein, Sogaard ‘17 Johns, Davison ‘18 z i FiLM: Visual Reasoning with a General Cross-Stitch Networks . Misra, Deep Relation Networks . Long, Wang ‘15 Conditioning Layer . Perez et al. ‘17 Shrivastava, Gupta, Hebert ‘16 Task weighting solutions: GradNorm . Chen et al. ‘18 MT Learning as Multi-Objective Optimization . Sener & Koltun. ‘19

Recommend


More recommend